SlideShare a Scribd company logo
1 of 131
Download to read offline
Abstract
Statistical and data science models are considered to be, somewhat
pejoratively, black-boxes, interpretation of which has not been
systematically studied.
Molnar’s “Interpretable Machine Learning” is a big effort in finding
solutions. Our presentation is humbler. We aim at presenting visual tools
for model interpretation based on partial dependency plots and their
variants, such as collapsed PDPs created by the presenter, some of
which may be polemical and debatable.
The audience should be versed in models creation, and at least some
insight into partial dependency plots. The presentation will be based on a
simple working example with 6 predictors and one binary target variable
for ease of exposition.
Not possible to detail exhaustively every method described in this
presentation. Extensive document in preparation. Presentation requires
3 hours and wide awake audience. Double if not awake.
Slides Marked **** can be skipped for easier first reading.
Overall comments and introduction.
Presentation by way of example focusing on Fraud/Default
Data sets and continuing previous chapters.
Aim: study interpretation/diagnosis mostly via Partial
Dependency Plots of logistic regression, Classification
Trees and Regression Boosting.
Presentation available at
https://www.slideshare.net/LeonardoAuslender/visual-
tools-for-interpretation-of-machine-learning-models
At present, lots of written opinions and distinctions about
topic. No time to discuss them all. See Molnar’s (2018)
recent book for an overall view.
Overall comments and introduction (cont 1).
No discussion about imbalanced data set modeling
or other modeling issues. No discussion on
literature, all due to time constraints.
This presentation introduces novel visual concepts
as well as tools derived from Partial Dependency
Plots (PDP):
-Overall PDP
-Collapsed PDP
-Marginal PDP
and how they assist in model interpretation.
Objectives of Interpretation.
Why does the model make mistakes (large residuals,
outliers, etc.)?
Which attributes (alone / group) end up being important?
Why is this attribute/s not important?
Why is this observation predicted with high probability
score?
However, immediate aim is NOT explanations at observation
level (why predicted sick/churner/innocent…) but
Objectives of Interpretation (cont. 1)
Why not directly at observation level?
Suppose model to predict entertainment type preference for
database of families in large cities. Since not possible to
obtain updated family preferences consistently, since data is
‘soft’, models necessarily are not created at specific family
levels.
Contrariwise, disease diagnostic prediction is closer to
individual explanation and interpretablity.
Model Interpretation categorization.
Just as in EDA (but on model results, not on initial data), three
types:
Univariate Model Interpretation (UMI): One variable at a
type. EASIEST to understand and huge source of “makes
sense”. E.g., Classical linear models interpretations. E.g.,
reasons to decline a loan.
Bivariate Model Interpretation (BMI): Looking at pairs of
variables to interpret model results.
Multivariate Model Interpreation (MMI): Overall model
interpretation, most difficult.
Typically, most work results in UMI and perhaps BMI.
Days of Linear Regression Interpretation ***
Based on “ceteris paribus” assumption that fails In case of
Even relatively small VIFs. At present, rule of thumb VIF >=
10 (R-sq = .90 among predictors)  unstable model.
“Ceteris paribus” exercise: Keeping all other predictors
Constant, an increase in …. But if R-sq among predictors is
Even 10%, not possible to keep all predictors constant while
Increasing by 1 the variable of interest.
Advantages: EASY to conceptualize because practice
Follows notion of bivariate correlation.
But notion is generally wrong in multivariate case..
Corr (X,Y) = if SD(Y) = SD(X). E.g., if both
Standardized, otherwise same sign at least, and
interpretation from correlation holds in simple regression
case.
Notice that regression of X on Y is NOT inverse of
regression of Y on X because of SD(X) and SD(Y).
/
Confusion on signs of coefficients
and interpretation.
( )
ˆ {
( )
} ˆ( ) ( )yi
xy xy
xi
xy
Y X
sY Y
r r
sX X
sg r sg
  
 
  

  



2
1 2
2
13 2019-06-14
In multiple linear regression, previous relationship does
not hold because predictors can be correlated (rxz)
weighted by ryz, hinting at co-linearity and/or relationships
of supression/enhancement 
. .
. 2
2
But in multivariate, e.g.: ,
estimated equation (emphasizing "partial")
and for example:
ˆ ˆ ˆ ,
ˆ
1
ˆ( ) ( )
( ) ( ) and 1
YX Z YZ X
Y YX YZ XZ
YX Z
X XZ
YX
YX YZ XZ XZ
Y X Z
Y a X Z
s r r r
s r
sg sg r
abs r abs r r r
   
 
 
   

  



 
  
Comment on Linear Model Interpretation
Even in traditional UMI land, we find that
multivariate relations given by Partial- and semi-
partial correlations must be part of the
interpretation.
Note that while correlation is a bivariate
relationship, partial and semipartial corrs can be
extended to multivariate setting.
However, even BMI and certainly MMI not so often
performed.
Searching for Important variables en route to answering
modeling question.
Case study: minimum components to make a car go
along highway.
1) Engine
2) Tires
3) Steering wheel
4) Transmission
5) Gas
6) ….. Other MMI aspects and interrelations.
Take just one of them out, and the car won’t drive. There is no
SINGLE most important variable but a minimum irreducible set of
them. In Data Science case with n  ∞, possibly many subsets of
‘important’ variables.
But “suspect VARS” good starting point of research.
Model Name Item Information
M2 TRN DATA set train
. TRN num obs 3595
VAL DATA set validata
. VAL num obs 2365
TST DATA set
. TST num obs 0
Dep. Var fraud
TRN % Events 20.389
VAL % Events 19.281
TST % Events
Original Vars + Labels Model
Name
M2
Variable Label
**
DOCTOR_VISITS Total visits to a doctor
MEMBER_DURATION Membership duration
**
NO_CLAIMS No of claims made recently
**
NUM_MEMBERS Number of members covered
**
OPTOM_PRESC Number of opticals claimed
**
TOTAL_SPEND Total spent on opticals
**
Requested Models: Names & Descriptions.
Full Model Name Model Description
Overall Models
M2 20 pct prior
M2_BY_DEPVAR Inference
01_M2_GB_TRN_TREES Tree Repr. for Gradient Boosting
02_M2_TRN_GRAD_BOOSTING Gradient Boosting
03_M2_TRN_LOGISTIC_STEPWISE Logistic TRN STEPWISE
04_M2_VAL_GRAD_BOOSTING Gradient Boosting
05_M2_VAL_LOGISTIC_STEPWISE Logistic VAL STEPWISE
— 22 —
Data set: Definition by way of Example
• Health insurance company:
Ophtamologic Insurance Claims
• Is claim valid or fraudulent? Binary
target.
• No transformations created to have
simple data set.
• Full description and analysis of this data
set in
https://www.slideshare.net/LeonardoAuslender
(lectures at Principal Analytics Prep).
Ch. 1.1-23
Alphabetic List of Variables and Attributes
# Variable Type Len Format Informa
t
Label
3 DOCTOR_VISITS Num 8 BEST12
.
F12. Total visits to a doctor
1 FRAUD Num 8 BEST12
.
F12. Fraudulent Activity yes/no
5 MEMBER_DURAT
ION
Num 8 Membership duration
4 NO_CLAIMS Num 8 BEST12
.
F12. No of claims made recently
7 NUM_MEMBERS Num 8 Number of members covered
6 OPTOM_PRESC Num 8 BEST12
.
F12. Number of opticals claimed
2 TOTAL_SPEND Num 8 BEST12
.
F12. Total spent on opticals
Note: No nominal predictors. No transformations to keep
presentation simple but not simpler than necessary
....
Reporting area for
all Model.s
coefficients
Importance, etc. and
Selected Variables.
Vars * Models * Coeffs
Model Name
M2_TRN_GRAD_BO
OSTING
M2_TRN_LOGIS
TIC_STEPWISE
M2_VAL_GRAD_BO
OSTING
M2_VAL_LOGISTIC_S
TEPWISE
Coeff /
Importanc
e
PVal /
Nrules
Coeff /
Import
ance
PVal /
Nrules
Coeff /
Importa
nce
PVal /
Nrules
Coeff /
Importan
ce
PVal /
Nrules
Variable
0.1099 2.000 0.1099 2.000
NUM_MEMBERS
OPTOM_PRESC
0.6211 19.000 0.2178 0.000 0.6211 19.000 0.1463 0.000
DOCTOR_VISITS
0.4434 20.000
-
0.0171 0.020 0.4434 20.000 -0.0065 0.428
MEMBER_DURATION
0.7843 41.000
-
0.0066 0.000 0.7843 41.000 -0.0065 0.000
TOTAL_SPEND
0.6864 29.000
-
0.0000 0.003 0.6864 29.000 -0.0000 0.004
NO_CLAIMS
1.0000 19.000 0.7752 0.000 1.0000 19.000 0.7610 0.000
INTERCEPT -
0.5767 0.000 -0.5635 0.001
Note: Importance and coefficients share one column as well
as p-values and number of rules.
Logistic Selection Steps
Model
Name
M2_TRN
_LOGIST
IC_STEP
WISE
# in
mo
del
P-
val
ue
Step Effect Entered Effect Removed
1 .00
1 no_claims
2 member_duration
2 .00
3 optom_presc
3 .00
4 total_spend
4 .00
5 doctor_visits
5 .02
Dropped Num_members.
Some conclusions and comments so far:
. Logistic stepwise dropped Num_members that is shown
with lowest relative importance in GB. Notice that Logistic
Regression does not have agreed-upon scale of importance.
We can use odds-ratios, e.g.
. NO_CLAIMS is deemed most important single variable for
GB, but logistic deems OPTOM_PRESC as the second one
(via odds ratios), while GB selected MEMBER_DURATION.
. Remaining variables have odds ratios of 1 which seem to
indicate similar effect, while GB distinguishes relative
importance after first two variables.
Definite monotonic relationship, higher values of NO_CLAIMS (e.g., bin 5)
associated with higher probability. Note differences between LG and BG.
More diffuse relationship.
GB monotonic decrease for OPTOM_PRESC, diffuse for TOTAL_SPEND.
Mg: Effect:
Change in Prob as X changes.
LG does not have measure of importance, as GB does 
we use marginal effects plots that indicate change in
probability along variable range. Except for
MEMBER_DURATION (that declines initially), other effects
have positive effect of different intensity and max value
declines as per logistic shape. Member duration has
pronounced decline for low duration levels,  possibility
of fraudulent members who join, commit their fraud and
leave.
Note sharper increase in prob. for NO_CLAIMS at bins 1
and 8. Optom_presc at 6.
GB Importance measures impact of individual inputs on
predicting Y, but don’t tell how impact changes along
range of inputs and individual variable effects are not in
consideration
 Use Partial Dependency Plots, also for LG in free ride.
Marginal Effects and PDPs
Marginal effects refer to change in probability with one
unit change in X, ceteris paribus (if meaningful or at
least desirable).
PDPs do not indicate change in Y at all; instead, PDP
measures probability levels at different values of X1
measuring all other predictors at their means (or modes,
medians, etc.).
 No Marginality in PDPs, unless we measured ‘change’
in probability as well. Shown later on, called Marginal
PDPs.
Tree representation(s) up to 4 levels Model M2_GB_TRN_TREES
Requested Tree Models: Names & Descriptions. Pred
Level 1 + Prob. Level 2 + Prob. Level 3 + Prob. Level 4 + Prob.
0.464
no_claims < 2.5 ( 0.185 ) no_claims < 0.5 (
0.159 )
member_duration
< 180.5 ( 0.199 )
total_spend < 5250
( 0.464 )
total_spend >=
5250 ( 0.186 ) 0.186
member_duration
>= 180.5 ( 0.103 )
doctor_visits >=
5.5 ( 0.093 ) 0.093
doctor_visits < 5.5
( 0.126 ) 0.126
no_claims >= 0.5 (
0.321 )
optom_presc < 3.5
( 0.291 )
total_spend >=
6300 ( 0.273 ) 0.273
total_spend < 6300
( 0.467 ) 0.467
optom_presc >=
3.5 ( 0.59 )
member_duration
< 154.5 ( 0.67 ) 0.670
member_duration
>= 154.5 ( 0.447 ) 0.447
no_claims >= 2.5 ( 0.633 ) no_claims < 4.5 (
0.57 )
optom_presc < 3.5
( 0.54 )
member_duration
>= 128.5 ( 0.498 ) 0.498
member_duration
< 128.5 ( 0.627 ) 0.627
optom_presc >=
3.5 ( 0.81 )
member_duration
>= 137 ( 0.785 ) 0.785
member_duration
< 137 ( 0.85 ) 0.850
no_claims >= 4.5 (
0.761 )
member_duration
< 303.5 ( 0.778 )
member_duration
>= 148 ( 0.757 ) 0.757
member_duration
< 148 ( 0.823 ) 0.823Missing one line.
Tree representation(s) up to 4 levels Model M2_LG_TRN_TREES
Requested Tree Models: Names & Descriptions. Pred
Level 1 + Prob. Level 2 + Prob. Level 3 + Prob. Level 4 + Prob.
0.195no_claims < 1.5 ( 0.164 ) member_duration < 155.5 (
0.235 )
optom_presc < 3.5 ( 0.213 ) no_claims < 0.5 ( 0.195 )
no_claims >= 0.5 ( 0.337 ) 0.337
optom_presc >= 3.5 ( 0.49 ) optom_presc < 6.5 ( 0.404 ) 0.404
optom_presc >= 6.5 ( 0.647
) 0.647
member_duration >= 155.5
( 0.111 )
optom_presc < 3.5 ( 0.103 ) member_duration >= 246.5
( 0.065 )
0.065
member_duration < 246.5 (
0.122 ) 0.122
optom_presc >= 3.5 ( 0.235
)
no_claims >= 0.5 ( 0.353 ) 0.353
no_claims < 0.5 ( 0.213 ) 0.213
no_claims >= 1.5 ( 0.61 ) no_claims < 2.5 ( 0.451 ) member_duration < 155.5 (
0.562 )
optom_presc >= 1.5 ( 0.651
) 0.651
optom_presc < 1.5 ( 0.493 ) 0.493
member_duration >= 155.5
( 0.353 )
member_duration >= 237 (
0.204 )
0.204
member_duration < 237 (
0.39 ) 0.390
no_claims >= 2.5 ( 0.748 ) no_claims < 4.5 ( 0.675 ) member_duration >= 236.5
( 0.477 )
0.477
member_duration < 236.5 (
0.721 ) 0.721
no_claims >= 4.5 ( 0.899 ) member_duration >= 272 (
0.741 )
0.741
Missing one line.
Comment on Tree Representations
LG starts by splitting on NO_CLAIMS at 2, while GB splits at
3. Predictions for first level across 2 models are similar (
.185 ; .633) for GB vs. (.164, .61) for LG, which indicates that
structures identified so far are similar so far (UMI interpr.)
While in 2nd level, GB only splits on NO_CLAIMS, LG splits
on MEMBER_DURATION for suspected non-fraudsters in the
first stage and on NO_CLAIMS for the fraudster suspects
(BMI already different).
Predictions are similar only for the 4th node in level 2 (.748
and .761) but different otherwise. The careful reader may
verify that these two predictions emerge by splitting on
NO_CLAIMS albeit at different values, which supports the
notion of No_claims being the leading clue in our research.
bù shì deMaybe?.
Comment on Tree Representations (cont.1)
NO_CLAIMS not so heavily used after 2nd level however,
and the structures of the models are clearly different. GB
does not use it at all, while LG splits at 4.5 to produce the
highest prediction level of .899. While GB did split initially on
No_claims 2.5 and then on 4.5, it did not reach the same
level of prediction as LG that started splitting at 1.5.
By going to the marginal effects plot, we can see that
No_claims has the largest slope for low values, but
member_duration has the highest for highest value of the
variable. But no similar plots can be created for GB.
Thus, found structures and consequent interpretations differ
and there is no isomorphism from one into the other.
Perhaps fractal approximation?
2nd most “important variable”, very different structures.
1st most “important” variable, very different structures.
Both probs increasing as No_claims increases except for GB between 14
and 17. GB > LG up to NO_CLAIMS = 3, later reverse. Notice increasing
difference in prob. after No_CLAIMS = 5.
Different binned residual results by model and by variables. Notice
LG better fit except between bins 11 through 17.
Overall declining relation except for area around bin 15. Notice no consistent
Higher levels of prob. of either model, contrary to two previous slides.
Both models fit relatively poorly for large values of Member_Duration.
Similar gulf in probabilities and severe divergence between 11 and 13. GB
makes more extreme jumps.
Similar behavior as that of Member_Duration.
Strict interpretations: different.
Obvious similarities: e.g., high values of NO_CLAIMS
related to high Probability Values mostly.
INTERESTING: curves not fully monotonically increasing
or decreasing.
Residual plots show that there are patterns not well
fitted.
....
Ranking the
models
by GOF.
Strongly summarized area for brevity sake, added just for completion.
GOF ranks
GOF measure
rank
AUR
OC
Avg
Squa
re
Error
Class
Rate
Cum
Lift
3rd
bin
Cum
Resp
Rate
3rd Gini
Preci
sion
Rate
Rsqu
are
Cram
er
Tjur
Ran
k
Ran
k
Ran
k
Ran
k
Ran
k
Ran
k
Ran
k
Ran
k
Unw
.
Mea
n
Unw
.
Medi
an
Model Name
1 1 2 1 1 1 1 1 1.13 1
02_M2_TRN_GRAD_BOOST
ING
03_M2_TRN_LOGISTIC_ST
EPWISE 2 2 1 2 2 2 2 2 1.88 2
GOF ranks
GOF measure
rank
AUR
OC
Avg
Squa
re
Error
Class
Rate
Cum
Lift
3rd
bin
Cum
Resp
Rate
3rd Gini
Preci
sion
Rate
Rsqu
are
Cram
er
Tjur
Rank Rank Rank Rank Rank Rank Rank Rank
Unw.
Mean
Unw.
Medi
an
Model Name
1 1 2 1 1 1 1 1 1.13 1
04_M2_VAL_GRAD_BOOS
TING
05_M2_VAL_LOGISTIC_S
TEPWISE 2 2 1 2 2 2 2 2 1.88 2
....
Profile and
Model Interpretation
Area.
....
Univariate Profile diagnostics
for 6
Important Vars.
....
Event Proportions
and Posterior Probabilites
for 5
Important Vars.
by original
Model Names.
Variables and probabilities binned for ease of
visualization. Proportion events same across models (it’s
just original data), but probabilities differ across models.
Not all cases shown.
Etc for the other variables.
Some observations
Binned No_claims: While similar in shape, GradBoost seriously
underestimates proportion of events throughout, while logistic has the
problem for bins 2, 3, 5, 6, 7. Logistic has a positive slope, while GB
flattens due to interactive GB model. Up to bin 7, similar behavior for
GB and LG, and then LG jumps to higher level of probability.
Binned Member_Duration: Probability distributions are similar but not
identical. For bins 1, 2, 3 and 16 methods underestimate proportion of
events. Slightly declining slope for both models.
Binned OPTOM_PRESC: Both methods failed to match proportion of
events in the mid range of the bins. Sudden positive upshift in
positive slope for GB starting at bin 15, while overall flat but positive
slope for Logistic.
....
Rescaled Variables
along binned
Posterior
Probability.
Interpretation: In bin 5, No_claims reaches overall max (100), while for bin 1 max is around
35 and 15 in 0-100 scale for respective models. Same interpretation for Q3, etc.
And Conversely ….. (GB = Tree repr. Of Grad_boosting …)
....
Partial Dependency
Plots and variants for
Non Ensemble
Models.
Some variables may be dropped due to computer resources.
Note the narrow range of GB PDPs compared to those of LG due to GB
interactive nature  more difficult to interpret.
Marginal (1) PDP comparative notes
(Marginal (1): one var at a time. Could also marginalize two vars at a time, not
done in this presentation).
BG marginals are rather flat, except for
MEMBER_DURATION, of which caveat later on.
LG is juicier, NO_CLAIMS increase of probability declines
along range, but OPTOM_PRESC increases, which seems to
indicate that leading reason would be prescriptions and not
overall claims.
Corresponding marginals for logistic end up with slowing
down of growh due to logistic shape. BG is not constrained
in that way.
PDP comparative notes
Overall PDP is Model probability when all predictors are at their means.
For LG, it’s about .17, while GB is .53. Individual PDPs (by def.) are
deviations from Overall when var of interest measured along its range,
while others remain at mean values. GB clumps most PDPs around
Overall, LG clearly distinct values instead.
Highest probability level for GB is around .7 while LG reaches 1, and
minima are around .6 and 0 respectively. Note LG monotonicity while
GB is mostly monotonic (except for Doctor_visits), possibly product of
data set created artificially.
In both cases, NO_CLAIMS appears as leading variable, especially in LG
but while Member_duration is rather flat in GB, it certainly declines
steadily in LG, with a very different interpretation. Longer member
duration implies steadier customer and familiarity. No_members had
been excluded in LG’s stepwise and should not be confused with
MEMBER_DURATION.
....
Now, mix
All previous
Probability
and PDPs
Together.
Member_duration alone brags too much.
UMI: Univariate Model Interpretation.
From preceding pages, we can conclude that:
No_claims: positively associated with increased fraud,
for both logistic and grad, but far steeper slope in
Logistic. Grad_b stays in narrow band of probabilty and
more interactive with other predictors  Grad_b requires
more BMI and MMI. GB’s PDP overshoots posterior
probability  other vars bring down this effect in GB.
Member_duration has a U shape relationship,
especially in Logistic case, while GB has a more spiky
one. Note the high spike at duration minimal and
immediate decline which seems to indicate members that
committed fraud as soon as they joined and left
immediately.
UMI: Univariate Model Interpretation (cont. 1)
PDP view: logistic shows positive effects of NO_CLAIMS
and OPTOM_PRESC, balanced by negative effects of
remaining variables.
Comparing posterior probability with No_claims PDP,
they are almost the same for Logistic. Similarly for
MEMBER_DURATION.
Grad_b instead shows more tepid effects of same
variables, and almost unchanging effects of remaining
predictors. Comparing PDP with probability, other
predictors bring down PDP of No_claims. Similar effects
for MEMBER_DURATION.
....
PDPs for
"Pairs of
Variables"
Note: 3d plots tend to interpolate areas of no data producing false
expectations of results. Thus, sometimes 3d charts preferable to
plots.
Not all Pairs of variables available due to computer resources.
Same for LG.
Note correlations (No_claims) with other variables are relative small when
compared to the pair Member_duration – Doctor_visits and Member_duration -
optom_presc. How will this translate in PDPs for 2 variables at a time?
M_: 'M2_TRN_LOGISTIC_STEPWISE' 'ORIGINAL' PDP Corr '-0.02542'
LG High levels of NO_CLAIMS have high probability for lowest level of
total_spend which probably denotes one time fraud. Otherwise, even mid levels
of NO_CLAIMS associated with high probability for any level of TOTAL_SPEND.
It seems that FRAUD is not necessarily linked to TOTAL_SPEND alone.
BINNED ORIGINAL PDP M2_TRN_GRAD_BOOSTING Corr ' 0.05073' NO_CLAIMS
DOCTOR_VISITS
Combination of No_claims & Doctor_visits shows high probability at NE corner
and middle section stable high prob. level. Too many charts to show but
necessary for full interpretation.
BINNED ORIGINAL PDP M2_TRN_GRAD_BOOSTING Corr ' 0.02549' NO_CLAIMS
MEMBER_DURATION
About NO_CLAIMS and MEMBER_DURATION, fraud happens for low
level of duration, after which fraudsters leave.
M_: 'M2_TRN_GRAD_BOOSTING' 'ORIGINAL' PDP Corr ' 0.06580'
Prob. growth for low NO_CLAIMS and OPTOM_PRESC increase (not at dramatic
as in LG), and then slow and steady prob. growth with increases in both vars.
BINNED ORIGINAL PDP M2_TRN_LOGISTIC_STEPWISE Corr ' 0.06580'
NO_CLAIMS OPTOM_PRESC
Steep prob. growth for low NO_CLAIMS and OPTOM_PRESC increase, similar
but not so pronounced growth for small OPTOM_PRESC and increasing
NO_CLAIMS.
BINNED ORIGINAL PDP M2_TRN_GRAD_BOOSTING Corr '-0.10759' MEMBER_DURATION
OPTOM_PRESC
About the pair Optom_presc and Member_duration for which we have contrasting
Pair PDPs, with corr = -0.10, interpretation is very different. While GB shows flat
probabilities throughout, except in NE empty corner, logistic shows more
extreme NE corner, plus declining probabilities from NW top.
BINNED ORIGINAL PDP M2_TRN_LOGISTIC_STEPWISE Corr '-0.10759' MEMBER_DURATION
OPTOM_PRESC
M_: 'M2_TRN_GRAD_BOOSTING' 'ORIGINAL' PDP Corr '-0.02542'
Rather flat relationship in GB, but steeper in LG (next slide)
M_: 'M2_TRN_LOGISTIC_STEPWISE' 'ORIGINAL' PDP Corr '-0.02542'
Some BMI comments
Previous charts suggest that one type of fraud happens once for low
levels of duration. Different types of fraud are linked with
Prescriptions and Claims.
For low levels of claims, increasing prescriptions leads to fraud, as
well as a combination of increasing claims and prescriptions
combined.
So it is possible that there are at least 3 types of Fraud being
committed.
Other pair combinations were not interesting and rather flat surfaces.
For 2 dimensional visualization, collapse 3d chart by
averaging levels of variable 2 into those of variable 1 and
compare to original PDP.
Original and collapsed PDPs (CPDPs) are derived from
posterior model probabilities.
No room for TOTAL_SPEND
Comments for BMI.
In case of NO_CLAIMS, all cases show overlapping of
collapsed and Original, except for OPTOM_PRESC. In GB
case, DURATION brings down probability slightly because
duration is itself strong predictor.
LG shows that presence of OPTOM_PRESC raises
posterior probability, not so accentuated in GB. LG model
could benefit of NO_CLAIMS and OPTOM_PRESC
interaction or possibly overall transformation by way of
obtaining information per month and per number of
members. (LG chart with TOTAL_SPEND omitted for space
brevity).
MEMBER_DURATION shows overlap with all second
variables, plus declining slope, more evident in LG
models.
Comments for BMI (cont).
It is possible to obtain 3-way and higher PDPs, and also
collapse them, not tried here.
Given overlap between Original and CPDP, UMI effects are
correct so far, except possibly for triplet NO_CLAIMS,
MEMBER_DURATION and OPTOM_PRESC.
Collapsed Triad and Tetrad PDPs.
In like manner as 3d PDPs, possible to obtain PDPs for 3 or more
variables but not possible to graph (at least, not yet). Collapsed
Triads bypasses problems by obtaining PDP of specific three
variable set (triads). And Tetrads, for 4 variable sets.
While it is possible to collapse two of them and present average PDP
compared to univariate PDPs, bivariate and original probability, also
possible to collapse Mean PDP but along 4 quartile ranges of the
third variable (in TRIAD case) or fourth variable (in TETRAD case)
(we bypass the quartile presentation for brevity).
In addition to mean PDPs, we also include max and min PDPs as an
overlay to provide view of probability variation.
Still working on improving these presentations.
Original probability not truly represented well by any combination of PDPs in
GB case. GB searches solution over residuals, not fully represented hereby.
Minor differences when compared to previous TRIAD.
Original probability represented very well with 3 variables
Minor differences between triads and tetrad representations.
....
MMI PDPs
in Difference form
Same as previous slide but in difference form.
Residual – iterative GB nature shows that just 3 variable do not capture full
Interpretation since Posterior probability remains below and above most of
data points.
Same as previous slide in difference form.
Comment on Triads and Tetrads
For present data set and choice of represented variables,
there was not much difference between TRIADS and
TETRADS, which means that omission of the fourth
variable does not alter our results by much.
Still we emphasize that it is quite illusory to merely
concentrate on average values. The graphs also contain
max and min values of probability that show the high
variability except for large values of NO_CLAIMS. This is
clear since the event rate was %20 and concentrated on
high values of that variable.
Lowest prob level represented by lowest levels of predictors, except for
member duration.
Highest prob levels determined by mid-levels of No_claims and higher total
spend than in bin 1..
Conclusions on multivariate distribution of predictors
along Prob. bins.
Since event rate is %20, good model should clump those observations
in higher prob. bins.
In lowest bin, predictors bins are typically lowest ones, except for
member_duration. Again, this seems to indicate presence of fraudsters
with short duration that leave immediately, as evinced in bin 5 of
probability slide.
In this one, most predictors have observations from higher bins.
Other probability bins not listed for brevity sake.
Similar coeffs among logistic , GB and beta regressions. By Beta Regression,
standard interpretation of log odds is possible, with caveats.
Vars * Models *
Coeffs
Model Name
M2_TRN_
GRAD_BO
OSTING
M2_TRN_
GRAD_BO
OSTING_
BETA_RE
G
M2_TRN_
LOGISTIC
_STEPWI
SE
M2_TRN_
LOGISTIC
_STEPWI
SE_BETA_
REG
M2_VAL_
GRAD_BO
OSTING
M2_VAL_
LOGISTIC
_STEPWI
SE
Coeff /
Importan
ce
Coeff /
Importan
ce
Coeff /
Importan
ce
Coeff /
Importan
ce
Coeff /
Importan
ce
Coeff /
Importan
ce
Variable
0.7318 -0.0051 -0.0057 -0.0057 0.7318 -0.0084
MEMBER_DURATIO
N
DOCTOR_VISITS
0.3925 -0.0061 0.3925
TOTAL_SPEND
0.6610 -0.0000 -0.0000 -0.0000 0.6610 -0.0000
OPTOM_PRESC
0.5944 0.1713 0.2132 0.2132 0.5944 0.1634
NO_CLAIMS
1.0000 0.7027 0.7921 0.7921 1.0000 0.7351
SCALE
21.8895
2.590291
516E16
INTERCEPT
-0.7979 -0.8352 -0.8352 -0.3111
Beta Regression results
Results of analyzing posterior probabilities (i.e.,
original GB and LG posteriors) via BETA
regression show very similar coefficients and
structures  Beta is reassuring but not providing
additional information.
Possible to say that …
1) Manipulate just NO_CLAIMS and problem
solved?
2) Maybe add MEMBER_DURATION and
OPTOM_PRESC for parts of NO_CLAIMS
range?
3) Maybe add if-then rules from simplified
TREE_REPRESENTATION because easier than
GB and more interactive than LG?
4) If using Neural Network, and NN derivatives
abandon all hope of interpretation?
5)  Interpretation needs definition of INTENDED
AUDIENCE (see Tolstoy ut supra).
Possible to say that … (cont. 1)
1) The analyst needs to focus on NO_CLAIMS,
MEMBER_DURATION and OPTOM_PRESC as
an ‘IMPORTANT’ group.
2) Different model Interpretations should be
entertained.
3) Different marginal effects must be explained.
Final thoughts …
MI analysis must proceed further for large
number of predictors obtaining insights from
collapsing 3+ way PDPs for instance.
If preferred ‘easier’ linear model explanation, beta
regression on posterior probability would provide
regression like information. Still, beta regression
is not straightforward and model selection is big
issue.
Ch. 1.1-127 2019-06-14
Future steps
Focus on MMI
1) Beta regression for ‘linear’ interpretation. More
difficult because it requires model search as well.
Plus, additional error in modeling posterior
probability of original model.
2) Andrews’ curves.
Lime: Local Interpretable Model-Agnostic
Explanations:
Uses surrogate interpretable model on black-box model, applied to observations of
interest. Tree representation in this presentation similar to this.
(https://homes.cs.washington.edu/~marcotcr/blog/lime/)
ICE: Clusters or classification variable applied to
PDP results. For given predictor, ICE plots draw one line per obs.,
representing how instance’s prediction changes when predictor
changes.
Shap Values: Shapley Additive Explanation (Lundberg
et al, 2017): Measures +/- feature contribution to prob. Technique used in
game theory to determine each player’s contribution to success in a
game. Affected by corrs among predictors  focusing just on one
predictor to change behavior may change other predictors as well
(available in Python).
AND OTHERS ….
References
Lundberg SM, Lee SI (2017), “Consistent feature
attribution for tree ensembles”, presented at the 2017
ICML Workshop on Human Interpretability in Machine
Learning (WHI 2017), Sydney, NSW, Australia
(https://arxiv.org/abs/1706.06060)
Molnar C. (2018): Interpretable Machine Learning, A
guide for making black box models explainable,
https://christophm.github.io/interpretable-ml-book/
Tolstoy Leo: “The Kingdom of God Is Within You”, (1894)
Visual Tools for explaining Machine Learning Models

More Related Content

What's hot

Impact of Normalization in Future
Impact of Normalization in FutureImpact of Normalization in Future
Impact of Normalization in Futureijtsrd
 
Introduction to statistical modeling in R
Introduction to statistical modeling in RIntroduction to statistical modeling in R
Introduction to statistical modeling in Rrichardchandler
 
Lesson 8 zscore
Lesson 8 zscoreLesson 8 zscore
Lesson 8 zscorenurun2010
 
Chahine Hypothesis Testing,
Chahine Hypothesis Testing,Chahine Hypothesis Testing,
Chahine Hypothesis Testing,Saad Chahine
 
PREDICTIVE EVALUATION OF THE STOCK PORTFOLIO PERFORMANCE USING FUZZY CMEANS A...
PREDICTIVE EVALUATION OF THE STOCK PORTFOLIO PERFORMANCE USING FUZZY CMEANS A...PREDICTIVE EVALUATION OF THE STOCK PORTFOLIO PERFORMANCE USING FUZZY CMEANS A...
PREDICTIVE EVALUATION OF THE STOCK PORTFOLIO PERFORMANCE USING FUZZY CMEANS A...ijfls
 
TYPE-2 FUZZY LINEAR PROGRAMMING PROBLEMS WITH PERFECTLY NORMAL INTERVAL TYPE-...
TYPE-2 FUZZY LINEAR PROGRAMMING PROBLEMS WITH PERFECTLY NORMAL INTERVAL TYPE-...TYPE-2 FUZZY LINEAR PROGRAMMING PROBLEMS WITH PERFECTLY NORMAL INTERVAL TYPE-...
TYPE-2 FUZZY LINEAR PROGRAMMING PROBLEMS WITH PERFECTLY NORMAL INTERVAL TYPE-...ijceronline
 
Physics 0625 - Paper 3 version 3 - Mark scheme - May Jun 2013
Physics 0625 - Paper 3 version 3 - Mark scheme - May Jun 2013Physics 0625 - Paper 3 version 3 - Mark scheme - May Jun 2013
Physics 0625 - Paper 3 version 3 - Mark scheme - May Jun 2013JakKy Kitmanacharounpong
 
Heteroscedasticity
HeteroscedasticityHeteroscedasticity
HeteroscedasticityMuhammad Ali
 
PRML Chapter 9
PRML Chapter 9PRML Chapter 9
PRML Chapter 9Sunwoo Kim
 
A NEW PERSPECTIVE OF PARAMODULATION COMPLEXITY BY SOLVING 100 SLIDING BLOCK P...
A NEW PERSPECTIVE OF PARAMODULATION COMPLEXITY BY SOLVING 100 SLIDING BLOCK P...A NEW PERSPECTIVE OF PARAMODULATION COMPLEXITY BY SOLVING 100 SLIDING BLOCK P...
A NEW PERSPECTIVE OF PARAMODULATION COMPLEXITY BY SOLVING 100 SLIDING BLOCK P...ijaia
 
Regression topics
Regression topicsRegression topics
Regression topicsGaetan Lion
 

What's hot (19)

Impact of Normalization in Future
Impact of Normalization in FutureImpact of Normalization in Future
Impact of Normalization in Future
 
Introduction to statistical modeling in R
Introduction to statistical modeling in RIntroduction to statistical modeling in R
Introduction to statistical modeling in R
 
Lesson 8 zscore
Lesson 8 zscoreLesson 8 zscore
Lesson 8 zscore
 
Chahine Hypothesis Testing,
Chahine Hypothesis Testing,Chahine Hypothesis Testing,
Chahine Hypothesis Testing,
 
FLIPKART SAMSUNG
FLIPKART SAMSUNGFLIPKART SAMSUNG
FLIPKART SAMSUNG
 
Chapter 8
Chapter 8Chapter 8
Chapter 8
 
PREDICTIVE EVALUATION OF THE STOCK PORTFOLIO PERFORMANCE USING FUZZY CMEANS A...
PREDICTIVE EVALUATION OF THE STOCK PORTFOLIO PERFORMANCE USING FUZZY CMEANS A...PREDICTIVE EVALUATION OF THE STOCK PORTFOLIO PERFORMANCE USING FUZZY CMEANS A...
PREDICTIVE EVALUATION OF THE STOCK PORTFOLIO PERFORMANCE USING FUZZY CMEANS A...
 
Lesson07
Lesson07Lesson07
Lesson07
 
Data analysis
Data analysisData analysis
Data analysis
 
TYPE-2 FUZZY LINEAR PROGRAMMING PROBLEMS WITH PERFECTLY NORMAL INTERVAL TYPE-...
TYPE-2 FUZZY LINEAR PROGRAMMING PROBLEMS WITH PERFECTLY NORMAL INTERVAL TYPE-...TYPE-2 FUZZY LINEAR PROGRAMMING PROBLEMS WITH PERFECTLY NORMAL INTERVAL TYPE-...
TYPE-2 FUZZY LINEAR PROGRAMMING PROBLEMS WITH PERFECTLY NORMAL INTERVAL TYPE-...
 
Physics 0625 - Paper 3 version 3 - Mark scheme - May Jun 2013
Physics 0625 - Paper 3 version 3 - Mark scheme - May Jun 2013Physics 0625 - Paper 3 version 3 - Mark scheme - May Jun 2013
Physics 0625 - Paper 3 version 3 - Mark scheme - May Jun 2013
 
Asment5n6 fraction
Asment5n6 fractionAsment5n6 fraction
Asment5n6 fraction
 
Malhotra20
Malhotra20Malhotra20
Malhotra20
 
Business statistics
Business statisticsBusiness statistics
Business statistics
 
Heteroscedasticity
HeteroscedasticityHeteroscedasticity
Heteroscedasticity
 
PRML Chapter 9
PRML Chapter 9PRML Chapter 9
PRML Chapter 9
 
Econometrics ch12
Econometrics ch12Econometrics ch12
Econometrics ch12
 
A NEW PERSPECTIVE OF PARAMODULATION COMPLEXITY BY SOLVING 100 SLIDING BLOCK P...
A NEW PERSPECTIVE OF PARAMODULATION COMPLEXITY BY SOLVING 100 SLIDING BLOCK P...A NEW PERSPECTIVE OF PARAMODULATION COMPLEXITY BY SOLVING 100 SLIDING BLOCK P...
A NEW PERSPECTIVE OF PARAMODULATION COMPLEXITY BY SOLVING 100 SLIDING BLOCK P...
 
Regression topics
Regression topicsRegression topics
Regression topics
 

Similar to Visual Tools for explaining Machine Learning Models

4_5_Model Interpretation and diagnostics part 4.pdf
4_5_Model Interpretation and diagnostics part 4.pdf4_5_Model Interpretation and diagnostics part 4.pdf
4_5_Model Interpretation and diagnostics part 4.pdfLeonardo Auslender
 
0 Model Interpretation setting.pdf
0 Model Interpretation setting.pdf0 Model Interpretation setting.pdf
0 Model Interpretation setting.pdfLeonardo Auslender
 
22_RepeatedMeasuresDesign_Complete.pptx
22_RepeatedMeasuresDesign_Complete.pptx22_RepeatedMeasuresDesign_Complete.pptx
22_RepeatedMeasuresDesign_Complete.pptxMarceloHenriques20
 
Multinomial Logistic Regression Analysis
Multinomial Logistic Regression AnalysisMultinomial Logistic Regression Analysis
Multinomial Logistic Regression AnalysisHARISH Kumar H R
 
Interpretability in ML & Sparse Linear Regression
Interpretability in ML & Sparse Linear RegressionInterpretability in ML & Sparse Linear Regression
Interpretability in ML & Sparse Linear RegressionUnchitta Kan
 
Pentaho Meeting 2008 - Statistics & BI
Pentaho Meeting 2008 - Statistics & BIPentaho Meeting 2008 - Statistics & BI
Pentaho Meeting 2008 - Statistics & BIStudio Synthesis
 
Multiple Regression.ppt
Multiple Regression.pptMultiple Regression.ppt
Multiple Regression.pptTanyaWadhwani4
 
Churn Analysis in Telecom Industry
Churn Analysis in Telecom IndustryChurn Analysis in Telecom Industry
Churn Analysis in Telecom IndustrySatyam Barsaiyan
 
© Charles T. Diebold, Ph.D., 73013. All Rights Reserved. Pa.docx
© Charles T. Diebold, Ph.D., 73013. All Rights Reserved.  Pa.docx© Charles T. Diebold, Ph.D., 73013. All Rights Reserved.  Pa.docx
© Charles T. Diebold, Ph.D., 73013. All Rights Reserved. Pa.docxLynellBull52
 
Writing a scientific paper : DISCUSSION AND CONCLUSION.ppt
Writing a scientific paper : DISCUSSION AND CONCLUSION.pptWriting a scientific paper : DISCUSSION AND CONCLUSION.ppt
Writing a scientific paper : DISCUSSION AND CONCLUSION.pptSuadFaraj
 
QM0012- STATISTICAL PROCESS CONTROL AND PROCESS CAPABILITY
QM0012- STATISTICAL PROCESS CONTROL AND PROCESS CAPABILITYQM0012- STATISTICAL PROCESS CONTROL AND PROCESS CAPABILITY
QM0012- STATISTICAL PROCESS CONTROL AND PROCESS CAPABILITYsmumbahelp
 
1. What two conditions must be met before an entity can be classif.docx
1. What two conditions must be met before an entity can be classif.docx1. What two conditions must be met before an entity can be classif.docx
1. What two conditions must be met before an entity can be classif.docxjackiewalcutt
 
Machine Learning Unit 3 Semester 3 MSc IT Part 2 Mumbai University
Machine Learning Unit 3 Semester 3  MSc IT Part 2 Mumbai UniversityMachine Learning Unit 3 Semester 3  MSc IT Part 2 Mumbai University
Machine Learning Unit 3 Semester 3 MSc IT Part 2 Mumbai UniversityMadhav Mishra
 
15 ch ken black solution
15 ch ken black solution15 ch ken black solution
15 ch ken black solutionKrunal Shah
 
German credit score shivaram prakash
German credit score shivaram prakashGerman credit score shivaram prakash
German credit score shivaram prakashShivaram Prakash
 

Similar to Visual Tools for explaining Machine Learning Models (20)

4_5_Model Interpretation and diagnostics part 4.pdf
4_5_Model Interpretation and diagnostics part 4.pdf4_5_Model Interpretation and diagnostics part 4.pdf
4_5_Model Interpretation and diagnostics part 4.pdf
 
0 Model Interpretation setting.pdf
0 Model Interpretation setting.pdf0 Model Interpretation setting.pdf
0 Model Interpretation setting.pdf
 
22_RepeatedMeasuresDesign_Complete.pptx
22_RepeatedMeasuresDesign_Complete.pptx22_RepeatedMeasuresDesign_Complete.pptx
22_RepeatedMeasuresDesign_Complete.pptx
 
Multinomial Logistic Regression Analysis
Multinomial Logistic Regression AnalysisMultinomial Logistic Regression Analysis
Multinomial Logistic Regression Analysis
 
Interpretability in ML & Sparse Linear Regression
Interpretability in ML & Sparse Linear RegressionInterpretability in ML & Sparse Linear Regression
Interpretability in ML & Sparse Linear Regression
 
Pentaho Meeting 2008 - Statistics & BI
Pentaho Meeting 2008 - Statistics & BIPentaho Meeting 2008 - Statistics & BI
Pentaho Meeting 2008 - Statistics & BI
 
Multiple Regression.ppt
Multiple Regression.pptMultiple Regression.ppt
Multiple Regression.ppt
 
Churn Analysis in Telecom Industry
Churn Analysis in Telecom IndustryChurn Analysis in Telecom Industry
Churn Analysis in Telecom Industry
 
Lesson 1 07 measures of variation
Lesson 1 07 measures of variationLesson 1 07 measures of variation
Lesson 1 07 measures of variation
 
© Charles T. Diebold, Ph.D., 73013. All Rights Reserved. Pa.docx
© Charles T. Diebold, Ph.D., 73013. All Rights Reserved.  Pa.docx© Charles T. Diebold, Ph.D., 73013. All Rights Reserved.  Pa.docx
© Charles T. Diebold, Ph.D., 73013. All Rights Reserved. Pa.docx
 
Introduction to cart_2009
Introduction to cart_2009Introduction to cart_2009
Introduction to cart_2009
 
report
reportreport
report
 
Writing a scientific paper : DISCUSSION AND CONCLUSION.ppt
Writing a scientific paper : DISCUSSION AND CONCLUSION.pptWriting a scientific paper : DISCUSSION AND CONCLUSION.ppt
Writing a scientific paper : DISCUSSION AND CONCLUSION.ppt
 
QM0012- STATISTICAL PROCESS CONTROL AND PROCESS CAPABILITY
QM0012- STATISTICAL PROCESS CONTROL AND PROCESS CAPABILITYQM0012- STATISTICAL PROCESS CONTROL AND PROCESS CAPABILITY
QM0012- STATISTICAL PROCESS CONTROL AND PROCESS CAPABILITY
 
1. What two conditions must be met before an entity can be classif.docx
1. What two conditions must be met before an entity can be classif.docx1. What two conditions must be met before an entity can be classif.docx
1. What two conditions must be met before an entity can be classif.docx
 
Machine Learning Unit 3 Semester 3 MSc IT Part 2 Mumbai University
Machine Learning Unit 3 Semester 3  MSc IT Part 2 Mumbai UniversityMachine Learning Unit 3 Semester 3  MSc IT Part 2 Mumbai University
Machine Learning Unit 3 Semester 3 MSc IT Part 2 Mumbai University
 
15 ch ken black solution
15 ch ken black solution15 ch ken black solution
15 ch ken black solution
 
FMI output gap
FMI output gapFMI output gap
FMI output gap
 
Wp13105
Wp13105Wp13105
Wp13105
 
German credit score shivaram prakash
German credit score shivaram prakashGerman credit score shivaram prakash
German credit score shivaram prakash
 

More from Leonardo Auslender

4_2_Ensemble models and gradient boosting2.pdf
4_2_Ensemble models and gradient boosting2.pdf4_2_Ensemble models and gradient boosting2.pdf
4_2_Ensemble models and gradient boosting2.pdfLeonardo Auslender
 
4_5_Model Interpretation and diagnostics part 4_B.pdf
4_5_Model Interpretation and diagnostics part 4_B.pdf4_5_Model Interpretation and diagnostics part 4_B.pdf
4_5_Model Interpretation and diagnostics part 4_B.pdfLeonardo Auslender
 
4_2_Ensemble models and grad boost part 2.pdf
4_2_Ensemble models and grad boost part 2.pdf4_2_Ensemble models and grad boost part 2.pdf
4_2_Ensemble models and grad boost part 2.pdfLeonardo Auslender
 
4_2_Ensemble models and grad boost part 3.pdf
4_2_Ensemble models and grad boost part 3.pdf4_2_Ensemble models and grad boost part 3.pdf
4_2_Ensemble models and grad boost part 3.pdfLeonardo Auslender
 
4_3_Ensemble models and grad boost part 2.pdf
4_3_Ensemble models and grad boost part 2.pdf4_3_Ensemble models and grad boost part 2.pdf
4_3_Ensemble models and grad boost part 2.pdfLeonardo Auslender
 
4_2_Ensemble models and grad boost part 1.pdf
4_2_Ensemble models and grad boost part 1.pdf4_2_Ensemble models and grad boost part 1.pdf
4_2_Ensemble models and grad boost part 1.pdfLeonardo Auslender
 
Classification methods and assessment.pdf
Classification methods and assessment.pdfClassification methods and assessment.pdf
Classification methods and assessment.pdfLeonardo Auslender
 
4 2 ensemble models and grad boost part 3 2019-10-07
4 2 ensemble models and grad boost part 3 2019-10-074 2 ensemble models and grad boost part 3 2019-10-07
4 2 ensemble models and grad boost part 3 2019-10-07Leonardo Auslender
 
4 2 ensemble models and grad boost part 2 2019-10-07
4 2 ensemble models and grad boost part 2 2019-10-074 2 ensemble models and grad boost part 2 2019-10-07
4 2 ensemble models and grad boost part 2 2019-10-07Leonardo Auslender
 
4 2 ensemble models and grad boost part 1 2019-10-07
4 2 ensemble models and grad boost part 1 2019-10-074 2 ensemble models and grad boost part 1 2019-10-07
4 2 ensemble models and grad boost part 1 2019-10-07Leonardo Auslender
 

More from Leonardo Auslender (20)

1 UMI.pdf
1 UMI.pdf1 UMI.pdf
1 UMI.pdf
 
Ensembles.pdf
Ensembles.pdfEnsembles.pdf
Ensembles.pdf
 
Suppression Enhancement.pdf
Suppression Enhancement.pdfSuppression Enhancement.pdf
Suppression Enhancement.pdf
 
4_2_Ensemble models and gradient boosting2.pdf
4_2_Ensemble models and gradient boosting2.pdf4_2_Ensemble models and gradient boosting2.pdf
4_2_Ensemble models and gradient boosting2.pdf
 
4_5_Model Interpretation and diagnostics part 4_B.pdf
4_5_Model Interpretation and diagnostics part 4_B.pdf4_5_Model Interpretation and diagnostics part 4_B.pdf
4_5_Model Interpretation and diagnostics part 4_B.pdf
 
4_2_Ensemble models and grad boost part 2.pdf
4_2_Ensemble models and grad boost part 2.pdf4_2_Ensemble models and grad boost part 2.pdf
4_2_Ensemble models and grad boost part 2.pdf
 
4_2_Ensemble models and grad boost part 3.pdf
4_2_Ensemble models and grad boost part 3.pdf4_2_Ensemble models and grad boost part 3.pdf
4_2_Ensemble models and grad boost part 3.pdf
 
4_3_Ensemble models and grad boost part 2.pdf
4_3_Ensemble models and grad boost part 2.pdf4_3_Ensemble models and grad boost part 2.pdf
4_3_Ensemble models and grad boost part 2.pdf
 
4_2_Ensemble models and grad boost part 1.pdf
4_2_Ensemble models and grad boost part 1.pdf4_2_Ensemble models and grad boost part 1.pdf
4_2_Ensemble models and grad boost part 1.pdf
 
4_1_Tree World.pdf
4_1_Tree World.pdf4_1_Tree World.pdf
4_1_Tree World.pdf
 
Classification methods and assessment.pdf
Classification methods and assessment.pdfClassification methods and assessment.pdf
Classification methods and assessment.pdf
 
Linear Regression.pdf
Linear Regression.pdfLinear Regression.pdf
Linear Regression.pdf
 
4 MEDA.pdf
4 MEDA.pdf4 MEDA.pdf
4 MEDA.pdf
 
2 UEDA.pdf
2 UEDA.pdf2 UEDA.pdf
2 UEDA.pdf
 
3 BEDA.pdf
3 BEDA.pdf3 BEDA.pdf
3 BEDA.pdf
 
1 EDA.pdf
1 EDA.pdf1 EDA.pdf
1 EDA.pdf
 
0 Statistics Intro.pdf
0 Statistics Intro.pdf0 Statistics Intro.pdf
0 Statistics Intro.pdf
 
4 2 ensemble models and grad boost part 3 2019-10-07
4 2 ensemble models and grad boost part 3 2019-10-074 2 ensemble models and grad boost part 3 2019-10-07
4 2 ensemble models and grad boost part 3 2019-10-07
 
4 2 ensemble models and grad boost part 2 2019-10-07
4 2 ensemble models and grad boost part 2 2019-10-074 2 ensemble models and grad boost part 2 2019-10-07
4 2 ensemble models and grad boost part 2 2019-10-07
 
4 2 ensemble models and grad boost part 1 2019-10-07
4 2 ensemble models and grad boost part 1 2019-10-074 2 ensemble models and grad boost part 1 2019-10-07
4 2 ensemble models and grad boost part 1 2019-10-07
 

Recently uploaded

20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
Spark3's new memory model/management
Spark3's new memory model/managementSpark3's new memory model/management
Spark3's new memory model/managementakshesh doshi
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxFurkanTasci3
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...shivangimorya083
 
Data Science Project: Advancements in Fetal Health Classification
Data Science Project: Advancements in Fetal Health ClassificationData Science Project: Advancements in Fetal Health Classification
Data Science Project: Advancements in Fetal Health ClassificationBoston Institute of Analytics
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 

Recently uploaded (20)

20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
Spark3's new memory model/management
Spark3's new memory model/managementSpark3's new memory model/management
Spark3's new memory model/management
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptx
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
 
Data Science Project: Advancements in Fetal Health Classification
Data Science Project: Advancements in Fetal Health ClassificationData Science Project: Advancements in Fetal Health Classification
Data Science Project: Advancements in Fetal Health Classification
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 

Visual Tools for explaining Machine Learning Models

  • 1.
  • 2. Abstract Statistical and data science models are considered to be, somewhat pejoratively, black-boxes, interpretation of which has not been systematically studied. Molnar’s “Interpretable Machine Learning” is a big effort in finding solutions. Our presentation is humbler. We aim at presenting visual tools for model interpretation based on partial dependency plots and their variants, such as collapsed PDPs created by the presenter, some of which may be polemical and debatable. The audience should be versed in models creation, and at least some insight into partial dependency plots. The presentation will be based on a simple working example with 6 predictors and one binary target variable for ease of exposition. Not possible to detail exhaustively every method described in this presentation. Extensive document in preparation. Presentation requires 3 hours and wide awake audience. Double if not awake. Slides Marked **** can be skipped for easier first reading.
  • 3.
  • 4. Overall comments and introduction. Presentation by way of example focusing on Fraud/Default Data sets and continuing previous chapters. Aim: study interpretation/diagnosis mostly via Partial Dependency Plots of logistic regression, Classification Trees and Regression Boosting. Presentation available at https://www.slideshare.net/LeonardoAuslender/visual- tools-for-interpretation-of-machine-learning-models At present, lots of written opinions and distinctions about topic. No time to discuss them all. See Molnar’s (2018) recent book for an overall view.
  • 5. Overall comments and introduction (cont 1). No discussion about imbalanced data set modeling or other modeling issues. No discussion on literature, all due to time constraints. This presentation introduces novel visual concepts as well as tools derived from Partial Dependency Plots (PDP): -Overall PDP -Collapsed PDP -Marginal PDP and how they assist in model interpretation.
  • 6. Objectives of Interpretation. Why does the model make mistakes (large residuals, outliers, etc.)? Which attributes (alone / group) end up being important? Why is this attribute/s not important? Why is this observation predicted with high probability score? However, immediate aim is NOT explanations at observation level (why predicted sick/churner/innocent…) but
  • 7. Objectives of Interpretation (cont. 1) Why not directly at observation level? Suppose model to predict entertainment type preference for database of families in large cities. Since not possible to obtain updated family preferences consistently, since data is ‘soft’, models necessarily are not created at specific family levels. Contrariwise, disease diagnostic prediction is closer to individual explanation and interpretablity.
  • 8.
  • 9. Model Interpretation categorization. Just as in EDA (but on model results, not on initial data), three types: Univariate Model Interpretation (UMI): One variable at a type. EASIEST to understand and huge source of “makes sense”. E.g., Classical linear models interpretations. E.g., reasons to decline a loan. Bivariate Model Interpretation (BMI): Looking at pairs of variables to interpret model results. Multivariate Model Interpreation (MMI): Overall model interpretation, most difficult. Typically, most work results in UMI and perhaps BMI.
  • 10.
  • 11. Days of Linear Regression Interpretation *** Based on “ceteris paribus” assumption that fails In case of Even relatively small VIFs. At present, rule of thumb VIF >= 10 (R-sq = .90 among predictors)  unstable model. “Ceteris paribus” exercise: Keeping all other predictors Constant, an increase in …. But if R-sq among predictors is Even 10%, not possible to keep all predictors constant while Increasing by 1 the variable of interest. Advantages: EASY to conceptualize because practice Follows notion of bivariate correlation. But notion is generally wrong in multivariate case..
  • 12. Corr (X,Y) = if SD(Y) = SD(X). E.g., if both Standardized, otherwise same sign at least, and interpretation from correlation holds in simple regression case. Notice that regression of X on Y is NOT inverse of regression of Y on X because of SD(X) and SD(Y). / Confusion on signs of coefficients and interpretation. ( ) ˆ { ( ) } ˆ( ) ( )yi xy xy xi xy Y X sY Y r r sX X sg r sg                2 1 2 2
  • 13. 13 2019-06-14 In multiple linear regression, previous relationship does not hold because predictors can be correlated (rxz) weighted by ryz, hinting at co-linearity and/or relationships of supression/enhancement  . . . 2 2 But in multivariate, e.g.: , estimated equation (emphasizing "partial") and for example: ˆ ˆ ˆ , ˆ 1 ˆ( ) ( ) ( ) ( ) and 1 YX Z YZ X Y YX YZ XZ YX Z X XZ YX YX YZ XZ XZ Y X Z Y a X Z s r r r s r sg sg r abs r abs r r r                        
  • 14. Comment on Linear Model Interpretation Even in traditional UMI land, we find that multivariate relations given by Partial- and semi- partial correlations must be part of the interpretation. Note that while correlation is a bivariate relationship, partial and semipartial corrs can be extended to multivariate setting. However, even BMI and certainly MMI not so often performed.
  • 15.
  • 16. Searching for Important variables en route to answering modeling question. Case study: minimum components to make a car go along highway. 1) Engine 2) Tires 3) Steering wheel 4) Transmission 5) Gas 6) ….. Other MMI aspects and interrelations. Take just one of them out, and the car won’t drive. There is no SINGLE most important variable but a minimum irreducible set of them. In Data Science case with n  ∞, possibly many subsets of ‘important’ variables. But “suspect VARS” good starting point of research.
  • 17.
  • 18.
  • 19. Model Name Item Information M2 TRN DATA set train . TRN num obs 3595 VAL DATA set validata . VAL num obs 2365 TST DATA set . TST num obs 0 Dep. Var fraud TRN % Events 20.389 VAL % Events 19.281 TST % Events
  • 20. Original Vars + Labels Model Name M2 Variable Label ** DOCTOR_VISITS Total visits to a doctor MEMBER_DURATION Membership duration ** NO_CLAIMS No of claims made recently ** NUM_MEMBERS Number of members covered ** OPTOM_PRESC Number of opticals claimed ** TOTAL_SPEND Total spent on opticals **
  • 21. Requested Models: Names & Descriptions. Full Model Name Model Description Overall Models M2 20 pct prior M2_BY_DEPVAR Inference 01_M2_GB_TRN_TREES Tree Repr. for Gradient Boosting 02_M2_TRN_GRAD_BOOSTING Gradient Boosting 03_M2_TRN_LOGISTIC_STEPWISE Logistic TRN STEPWISE 04_M2_VAL_GRAD_BOOSTING Gradient Boosting 05_M2_VAL_LOGISTIC_STEPWISE Logistic VAL STEPWISE
  • 22. — 22 — Data set: Definition by way of Example • Health insurance company: Ophtamologic Insurance Claims • Is claim valid or fraudulent? Binary target. • No transformations created to have simple data set. • Full description and analysis of this data set in https://www.slideshare.net/LeonardoAuslender (lectures at Principal Analytics Prep).
  • 23. Ch. 1.1-23 Alphabetic List of Variables and Attributes # Variable Type Len Format Informa t Label 3 DOCTOR_VISITS Num 8 BEST12 . F12. Total visits to a doctor 1 FRAUD Num 8 BEST12 . F12. Fraudulent Activity yes/no 5 MEMBER_DURAT ION Num 8 Membership duration 4 NO_CLAIMS Num 8 BEST12 . F12. No of claims made recently 7 NUM_MEMBERS Num 8 Number of members covered 6 OPTOM_PRESC Num 8 BEST12 . F12. Number of opticals claimed 2 TOTAL_SPEND Num 8 BEST12 . F12. Total spent on opticals Note: No nominal predictors. No transformations to keep presentation simple but not simpler than necessary
  • 24. .... Reporting area for all Model.s coefficients Importance, etc. and Selected Variables.
  • 25. Vars * Models * Coeffs Model Name M2_TRN_GRAD_BO OSTING M2_TRN_LOGIS TIC_STEPWISE M2_VAL_GRAD_BO OSTING M2_VAL_LOGISTIC_S TEPWISE Coeff / Importanc e PVal / Nrules Coeff / Import ance PVal / Nrules Coeff / Importa nce PVal / Nrules Coeff / Importan ce PVal / Nrules Variable 0.1099 2.000 0.1099 2.000 NUM_MEMBERS OPTOM_PRESC 0.6211 19.000 0.2178 0.000 0.6211 19.000 0.1463 0.000 DOCTOR_VISITS 0.4434 20.000 - 0.0171 0.020 0.4434 20.000 -0.0065 0.428 MEMBER_DURATION 0.7843 41.000 - 0.0066 0.000 0.7843 41.000 -0.0065 0.000 TOTAL_SPEND 0.6864 29.000 - 0.0000 0.003 0.6864 29.000 -0.0000 0.004 NO_CLAIMS 1.0000 19.000 0.7752 0.000 1.0000 19.000 0.7610 0.000 INTERCEPT - 0.5767 0.000 -0.5635 0.001 Note: Importance and coefficients share one column as well as p-values and number of rules.
  • 26. Logistic Selection Steps Model Name M2_TRN _LOGIST IC_STEP WISE # in mo del P- val ue Step Effect Entered Effect Removed 1 .00 1 no_claims 2 member_duration 2 .00 3 optom_presc 3 .00 4 total_spend 4 .00 5 doctor_visits 5 .02 Dropped Num_members.
  • 27.
  • 28.
  • 29. Some conclusions and comments so far: . Logistic stepwise dropped Num_members that is shown with lowest relative importance in GB. Notice that Logistic Regression does not have agreed-upon scale of importance. We can use odds-ratios, e.g. . NO_CLAIMS is deemed most important single variable for GB, but logistic deems OPTOM_PRESC as the second one (via odds ratios), while GB selected MEMBER_DURATION. . Remaining variables have odds ratios of 1 which seem to indicate similar effect, while GB distinguishes relative importance after first two variables.
  • 30.
  • 31. Definite monotonic relationship, higher values of NO_CLAIMS (e.g., bin 5) associated with higher probability. Note differences between LG and BG.
  • 33. GB monotonic decrease for OPTOM_PRESC, diffuse for TOTAL_SPEND.
  • 34.
  • 35. Mg: Effect: Change in Prob as X changes.
  • 36. LG does not have measure of importance, as GB does  we use marginal effects plots that indicate change in probability along variable range. Except for MEMBER_DURATION (that declines initially), other effects have positive effect of different intensity and max value declines as per logistic shape. Member duration has pronounced decline for low duration levels,  possibility of fraudulent members who join, commit their fraud and leave. Note sharper increase in prob. for NO_CLAIMS at bins 1 and 8. Optom_presc at 6. GB Importance measures impact of individual inputs on predicting Y, but don’t tell how impact changes along range of inputs and individual variable effects are not in consideration  Use Partial Dependency Plots, also for LG in free ride.
  • 37. Marginal Effects and PDPs Marginal effects refer to change in probability with one unit change in X, ceteris paribus (if meaningful or at least desirable). PDPs do not indicate change in Y at all; instead, PDP measures probability levels at different values of X1 measuring all other predictors at their means (or modes, medians, etc.).  No Marginality in PDPs, unless we measured ‘change’ in probability as well. Shown later on, called Marginal PDPs.
  • 38.
  • 39. Tree representation(s) up to 4 levels Model M2_GB_TRN_TREES Requested Tree Models: Names & Descriptions. Pred Level 1 + Prob. Level 2 + Prob. Level 3 + Prob. Level 4 + Prob. 0.464 no_claims < 2.5 ( 0.185 ) no_claims < 0.5 ( 0.159 ) member_duration < 180.5 ( 0.199 ) total_spend < 5250 ( 0.464 ) total_spend >= 5250 ( 0.186 ) 0.186 member_duration >= 180.5 ( 0.103 ) doctor_visits >= 5.5 ( 0.093 ) 0.093 doctor_visits < 5.5 ( 0.126 ) 0.126 no_claims >= 0.5 ( 0.321 ) optom_presc < 3.5 ( 0.291 ) total_spend >= 6300 ( 0.273 ) 0.273 total_spend < 6300 ( 0.467 ) 0.467 optom_presc >= 3.5 ( 0.59 ) member_duration < 154.5 ( 0.67 ) 0.670 member_duration >= 154.5 ( 0.447 ) 0.447 no_claims >= 2.5 ( 0.633 ) no_claims < 4.5 ( 0.57 ) optom_presc < 3.5 ( 0.54 ) member_duration >= 128.5 ( 0.498 ) 0.498 member_duration < 128.5 ( 0.627 ) 0.627 optom_presc >= 3.5 ( 0.81 ) member_duration >= 137 ( 0.785 ) 0.785 member_duration < 137 ( 0.85 ) 0.850 no_claims >= 4.5 ( 0.761 ) member_duration < 303.5 ( 0.778 ) member_duration >= 148 ( 0.757 ) 0.757 member_duration < 148 ( 0.823 ) 0.823Missing one line.
  • 40. Tree representation(s) up to 4 levels Model M2_LG_TRN_TREES Requested Tree Models: Names & Descriptions. Pred Level 1 + Prob. Level 2 + Prob. Level 3 + Prob. Level 4 + Prob. 0.195no_claims < 1.5 ( 0.164 ) member_duration < 155.5 ( 0.235 ) optom_presc < 3.5 ( 0.213 ) no_claims < 0.5 ( 0.195 ) no_claims >= 0.5 ( 0.337 ) 0.337 optom_presc >= 3.5 ( 0.49 ) optom_presc < 6.5 ( 0.404 ) 0.404 optom_presc >= 6.5 ( 0.647 ) 0.647 member_duration >= 155.5 ( 0.111 ) optom_presc < 3.5 ( 0.103 ) member_duration >= 246.5 ( 0.065 ) 0.065 member_duration < 246.5 ( 0.122 ) 0.122 optom_presc >= 3.5 ( 0.235 ) no_claims >= 0.5 ( 0.353 ) 0.353 no_claims < 0.5 ( 0.213 ) 0.213 no_claims >= 1.5 ( 0.61 ) no_claims < 2.5 ( 0.451 ) member_duration < 155.5 ( 0.562 ) optom_presc >= 1.5 ( 0.651 ) 0.651 optom_presc < 1.5 ( 0.493 ) 0.493 member_duration >= 155.5 ( 0.353 ) member_duration >= 237 ( 0.204 ) 0.204 member_duration < 237 ( 0.39 ) 0.390 no_claims >= 2.5 ( 0.748 ) no_claims < 4.5 ( 0.675 ) member_duration >= 236.5 ( 0.477 ) 0.477 member_duration < 236.5 ( 0.721 ) 0.721 no_claims >= 4.5 ( 0.899 ) member_duration >= 272 ( 0.741 ) 0.741 Missing one line.
  • 41. Comment on Tree Representations LG starts by splitting on NO_CLAIMS at 2, while GB splits at 3. Predictions for first level across 2 models are similar ( .185 ; .633) for GB vs. (.164, .61) for LG, which indicates that structures identified so far are similar so far (UMI interpr.) While in 2nd level, GB only splits on NO_CLAIMS, LG splits on MEMBER_DURATION for suspected non-fraudsters in the first stage and on NO_CLAIMS for the fraudster suspects (BMI already different). Predictions are similar only for the 4th node in level 2 (.748 and .761) but different otherwise. The careful reader may verify that these two predictions emerge by splitting on NO_CLAIMS albeit at different values, which supports the notion of No_claims being the leading clue in our research.
  • 42.
  • 44. Comment on Tree Representations (cont.1) NO_CLAIMS not so heavily used after 2nd level however, and the structures of the models are clearly different. GB does not use it at all, while LG splits at 4.5 to produce the highest prediction level of .899. While GB did split initially on No_claims 2.5 and then on 4.5, it did not reach the same level of prediction as LG that started splitting at 1.5. By going to the marginal effects plot, we can see that No_claims has the largest slope for low values, but member_duration has the highest for highest value of the variable. But no similar plots can be created for GB. Thus, found structures and consequent interpretations differ and there is no isomorphism from one into the other. Perhaps fractal approximation?
  • 45.
  • 46. 2nd most “important variable”, very different structures.
  • 47. 1st most “important” variable, very different structures.
  • 48.
  • 49. Both probs increasing as No_claims increases except for GB between 14 and 17. GB > LG up to NO_CLAIMS = 3, later reverse. Notice increasing difference in prob. after No_CLAIMS = 5.
  • 50. Different binned residual results by model and by variables. Notice LG better fit except between bins 11 through 17.
  • 51. Overall declining relation except for area around bin 15. Notice no consistent Higher levels of prob. of either model, contrary to two previous slides.
  • 52. Both models fit relatively poorly for large values of Member_Duration.
  • 53. Similar gulf in probabilities and severe divergence between 11 and 13. GB makes more extreme jumps.
  • 54. Similar behavior as that of Member_Duration.
  • 55.
  • 56.
  • 57. Strict interpretations: different. Obvious similarities: e.g., high values of NO_CLAIMS related to high Probability Values mostly. INTERESTING: curves not fully monotonically increasing or decreasing. Residual plots show that there are patterns not well fitted.
  • 58. .... Ranking the models by GOF. Strongly summarized area for brevity sake, added just for completion.
  • 59. GOF ranks GOF measure rank AUR OC Avg Squa re Error Class Rate Cum Lift 3rd bin Cum Resp Rate 3rd Gini Preci sion Rate Rsqu are Cram er Tjur Ran k Ran k Ran k Ran k Ran k Ran k Ran k Ran k Unw . Mea n Unw . Medi an Model Name 1 1 2 1 1 1 1 1 1.13 1 02_M2_TRN_GRAD_BOOST ING 03_M2_TRN_LOGISTIC_ST EPWISE 2 2 1 2 2 2 2 2 1.88 2 GOF ranks GOF measure rank AUR OC Avg Squa re Error Class Rate Cum Lift 3rd bin Cum Resp Rate 3rd Gini Preci sion Rate Rsqu are Cram er Tjur Rank Rank Rank Rank Rank Rank Rank Rank Unw. Mean Unw. Medi an Model Name 1 1 2 1 1 1 1 1 1.13 1 04_M2_VAL_GRAD_BOOS TING 05_M2_VAL_LOGISTIC_S TEPWISE 2 2 1 2 2 2 2 2 1.88 2
  • 60. .... Profile and Model Interpretation Area. .... Univariate Profile diagnostics for 6 Important Vars.
  • 61. .... Event Proportions and Posterior Probabilites for 5 Important Vars. by original Model Names. Variables and probabilities binned for ease of visualization. Proportion events same across models (it’s just original data), but probabilities differ across models. Not all cases shown.
  • 62.
  • 63.
  • 64. Etc for the other variables.
  • 65. Some observations Binned No_claims: While similar in shape, GradBoost seriously underestimates proportion of events throughout, while logistic has the problem for bins 2, 3, 5, 6, 7. Logistic has a positive slope, while GB flattens due to interactive GB model. Up to bin 7, similar behavior for GB and LG, and then LG jumps to higher level of probability. Binned Member_Duration: Probability distributions are similar but not identical. For bins 1, 2, 3 and 16 methods underestimate proportion of events. Slightly declining slope for both models. Binned OPTOM_PRESC: Both methods failed to match proportion of events in the mid range of the bins. Sudden positive upshift in positive slope for GB starting at bin 15, while overall flat but positive slope for Logistic.
  • 67. Interpretation: In bin 5, No_claims reaches overall max (100), while for bin 1 max is around 35 and 15 in 0-100 scale for respective models. Same interpretation for Q3, etc.
  • 68.
  • 69.
  • 70.
  • 71. And Conversely ….. (GB = Tree repr. Of Grad_boosting …)
  • 72.
  • 73. .... Partial Dependency Plots and variants for Non Ensemble Models. Some variables may be dropped due to computer resources.
  • 74. Note the narrow range of GB PDPs compared to those of LG due to GB interactive nature  more difficult to interpret.
  • 75.
  • 76. Marginal (1) PDP comparative notes (Marginal (1): one var at a time. Could also marginalize two vars at a time, not done in this presentation). BG marginals are rather flat, except for MEMBER_DURATION, of which caveat later on. LG is juicier, NO_CLAIMS increase of probability declines along range, but OPTOM_PRESC increases, which seems to indicate that leading reason would be prescriptions and not overall claims. Corresponding marginals for logistic end up with slowing down of growh due to logistic shape. BG is not constrained in that way.
  • 77. PDP comparative notes Overall PDP is Model probability when all predictors are at their means. For LG, it’s about .17, while GB is .53. Individual PDPs (by def.) are deviations from Overall when var of interest measured along its range, while others remain at mean values. GB clumps most PDPs around Overall, LG clearly distinct values instead. Highest probability level for GB is around .7 while LG reaches 1, and minima are around .6 and 0 respectively. Note LG monotonicity while GB is mostly monotonic (except for Doctor_visits), possibly product of data set created artificially. In both cases, NO_CLAIMS appears as leading variable, especially in LG but while Member_duration is rather flat in GB, it certainly declines steadily in LG, with a very different interpretation. Longer member duration implies steadier customer and familiarity. No_members had been excluded in LG’s stepwise and should not be confused with MEMBER_DURATION.
  • 79.
  • 81.
  • 82. UMI: Univariate Model Interpretation. From preceding pages, we can conclude that: No_claims: positively associated with increased fraud, for both logistic and grad, but far steeper slope in Logistic. Grad_b stays in narrow band of probabilty and more interactive with other predictors  Grad_b requires more BMI and MMI. GB’s PDP overshoots posterior probability  other vars bring down this effect in GB. Member_duration has a U shape relationship, especially in Logistic case, while GB has a more spiky one. Note the high spike at duration minimal and immediate decline which seems to indicate members that committed fraud as soon as they joined and left immediately.
  • 83. UMI: Univariate Model Interpretation (cont. 1) PDP view: logistic shows positive effects of NO_CLAIMS and OPTOM_PRESC, balanced by negative effects of remaining variables. Comparing posterior probability with No_claims PDP, they are almost the same for Logistic. Similarly for MEMBER_DURATION. Grad_b instead shows more tepid effects of same variables, and almost unchanging effects of remaining predictors. Comparing PDP with probability, other predictors bring down PDP of No_claims. Similar effects for MEMBER_DURATION.
  • 84. .... PDPs for "Pairs of Variables" Note: 3d plots tend to interpolate areas of no data producing false expectations of results. Thus, sometimes 3d charts preferable to plots. Not all Pairs of variables available due to computer resources.
  • 85. Same for LG. Note correlations (No_claims) with other variables are relative small when compared to the pair Member_duration – Doctor_visits and Member_duration - optom_presc. How will this translate in PDPs for 2 variables at a time?
  • 86. M_: 'M2_TRN_LOGISTIC_STEPWISE' 'ORIGINAL' PDP Corr '-0.02542' LG High levels of NO_CLAIMS have high probability for lowest level of total_spend which probably denotes one time fraud. Otherwise, even mid levels of NO_CLAIMS associated with high probability for any level of TOTAL_SPEND. It seems that FRAUD is not necessarily linked to TOTAL_SPEND alone.
  • 87. BINNED ORIGINAL PDP M2_TRN_GRAD_BOOSTING Corr ' 0.05073' NO_CLAIMS DOCTOR_VISITS Combination of No_claims & Doctor_visits shows high probability at NE corner and middle section stable high prob. level. Too many charts to show but necessary for full interpretation.
  • 88. BINNED ORIGINAL PDP M2_TRN_GRAD_BOOSTING Corr ' 0.02549' NO_CLAIMS MEMBER_DURATION About NO_CLAIMS and MEMBER_DURATION, fraud happens for low level of duration, after which fraudsters leave.
  • 89. M_: 'M2_TRN_GRAD_BOOSTING' 'ORIGINAL' PDP Corr ' 0.06580' Prob. growth for low NO_CLAIMS and OPTOM_PRESC increase (not at dramatic as in LG), and then slow and steady prob. growth with increases in both vars.
  • 90. BINNED ORIGINAL PDP M2_TRN_LOGISTIC_STEPWISE Corr ' 0.06580' NO_CLAIMS OPTOM_PRESC Steep prob. growth for low NO_CLAIMS and OPTOM_PRESC increase, similar but not so pronounced growth for small OPTOM_PRESC and increasing NO_CLAIMS.
  • 91. BINNED ORIGINAL PDP M2_TRN_GRAD_BOOSTING Corr '-0.10759' MEMBER_DURATION OPTOM_PRESC About the pair Optom_presc and Member_duration for which we have contrasting Pair PDPs, with corr = -0.10, interpretation is very different. While GB shows flat probabilities throughout, except in NE empty corner, logistic shows more extreme NE corner, plus declining probabilities from NW top.
  • 92. BINNED ORIGINAL PDP M2_TRN_LOGISTIC_STEPWISE Corr '-0.10759' MEMBER_DURATION OPTOM_PRESC
  • 93. M_: 'M2_TRN_GRAD_BOOSTING' 'ORIGINAL' PDP Corr '-0.02542' Rather flat relationship in GB, but steeper in LG (next slide)
  • 95. Some BMI comments Previous charts suggest that one type of fraud happens once for low levels of duration. Different types of fraud are linked with Prescriptions and Claims. For low levels of claims, increasing prescriptions leads to fraud, as well as a combination of increasing claims and prescriptions combined. So it is possible that there are at least 3 types of Fraud being committed. Other pair combinations were not interesting and rather flat surfaces.
  • 96. For 2 dimensional visualization, collapse 3d chart by averaging levels of variable 2 into those of variable 1 and compare to original PDP. Original and collapsed PDPs (CPDPs) are derived from posterior model probabilities.
  • 97. No room for TOTAL_SPEND
  • 98.
  • 99.
  • 100. Comments for BMI. In case of NO_CLAIMS, all cases show overlapping of collapsed and Original, except for OPTOM_PRESC. In GB case, DURATION brings down probability slightly because duration is itself strong predictor. LG shows that presence of OPTOM_PRESC raises posterior probability, not so accentuated in GB. LG model could benefit of NO_CLAIMS and OPTOM_PRESC interaction or possibly overall transformation by way of obtaining information per month and per number of members. (LG chart with TOTAL_SPEND omitted for space brevity). MEMBER_DURATION shows overlap with all second variables, plus declining slope, more evident in LG models.
  • 101. Comments for BMI (cont). It is possible to obtain 3-way and higher PDPs, and also collapse them, not tried here. Given overlap between Original and CPDP, UMI effects are correct so far, except possibly for triplet NO_CLAIMS, MEMBER_DURATION and OPTOM_PRESC.
  • 102.
  • 103.
  • 104.
  • 105. Collapsed Triad and Tetrad PDPs. In like manner as 3d PDPs, possible to obtain PDPs for 3 or more variables but not possible to graph (at least, not yet). Collapsed Triads bypasses problems by obtaining PDP of specific three variable set (triads). And Tetrads, for 4 variable sets. While it is possible to collapse two of them and present average PDP compared to univariate PDPs, bivariate and original probability, also possible to collapse Mean PDP but along 4 quartile ranges of the third variable (in TRIAD case) or fourth variable (in TETRAD case) (we bypass the quartile presentation for brevity). In addition to mean PDPs, we also include max and min PDPs as an overlay to provide view of probability variation. Still working on improving these presentations.
  • 106. Original probability not truly represented well by any combination of PDPs in GB case. GB searches solution over residuals, not fully represented hereby.
  • 107. Minor differences when compared to previous TRIAD.
  • 108. Original probability represented very well with 3 variables
  • 109. Minor differences between triads and tetrad representations.
  • 111. Same as previous slide but in difference form.
  • 112. Residual – iterative GB nature shows that just 3 variable do not capture full Interpretation since Posterior probability remains below and above most of data points.
  • 113. Same as previous slide in difference form.
  • 114. Comment on Triads and Tetrads For present data set and choice of represented variables, there was not much difference between TRIADS and TETRADS, which means that omission of the fourth variable does not alter our results by much. Still we emphasize that it is quite illusory to merely concentrate on average values. The graphs also contain max and min values of probability that show the high variability except for large values of NO_CLAIMS. This is clear since the event rate was %20 and concentrated on high values of that variable.
  • 115.
  • 116. Lowest prob level represented by lowest levels of predictors, except for member duration.
  • 117. Highest prob levels determined by mid-levels of No_claims and higher total spend than in bin 1..
  • 118. Conclusions on multivariate distribution of predictors along Prob. bins. Since event rate is %20, good model should clump those observations in higher prob. bins. In lowest bin, predictors bins are typically lowest ones, except for member_duration. Again, this seems to indicate presence of fraudsters with short duration that leave immediately, as evinced in bin 5 of probability slide. In this one, most predictors have observations from higher bins. Other probability bins not listed for brevity sake.
  • 119.
  • 120. Similar coeffs among logistic , GB and beta regressions. By Beta Regression, standard interpretation of log odds is possible, with caveats. Vars * Models * Coeffs Model Name M2_TRN_ GRAD_BO OSTING M2_TRN_ GRAD_BO OSTING_ BETA_RE G M2_TRN_ LOGISTIC _STEPWI SE M2_TRN_ LOGISTIC _STEPWI SE_BETA_ REG M2_VAL_ GRAD_BO OSTING M2_VAL_ LOGISTIC _STEPWI SE Coeff / Importan ce Coeff / Importan ce Coeff / Importan ce Coeff / Importan ce Coeff / Importan ce Coeff / Importan ce Variable 0.7318 -0.0051 -0.0057 -0.0057 0.7318 -0.0084 MEMBER_DURATIO N DOCTOR_VISITS 0.3925 -0.0061 0.3925 TOTAL_SPEND 0.6610 -0.0000 -0.0000 -0.0000 0.6610 -0.0000 OPTOM_PRESC 0.5944 0.1713 0.2132 0.2132 0.5944 0.1634 NO_CLAIMS 1.0000 0.7027 0.7921 0.7921 1.0000 0.7351 SCALE 21.8895 2.590291 516E16 INTERCEPT -0.7979 -0.8352 -0.8352 -0.3111
  • 121. Beta Regression results Results of analyzing posterior probabilities (i.e., original GB and LG posteriors) via BETA regression show very similar coefficients and structures  Beta is reassuring but not providing additional information.
  • 122.
  • 123. Possible to say that … 1) Manipulate just NO_CLAIMS and problem solved? 2) Maybe add MEMBER_DURATION and OPTOM_PRESC for parts of NO_CLAIMS range? 3) Maybe add if-then rules from simplified TREE_REPRESENTATION because easier than GB and more interactive than LG? 4) If using Neural Network, and NN derivatives abandon all hope of interpretation? 5)  Interpretation needs definition of INTENDED AUDIENCE (see Tolstoy ut supra).
  • 124. Possible to say that … (cont. 1) 1) The analyst needs to focus on NO_CLAIMS, MEMBER_DURATION and OPTOM_PRESC as an ‘IMPORTANT’ group. 2) Different model Interpretations should be entertained. 3) Different marginal effects must be explained.
  • 125. Final thoughts … MI analysis must proceed further for large number of predictors obtaining insights from collapsing 3+ way PDPs for instance. If preferred ‘easier’ linear model explanation, beta regression on posterior probability would provide regression like information. Still, beta regression is not straightforward and model selection is big issue.
  • 126.
  • 127. Ch. 1.1-127 2019-06-14 Future steps Focus on MMI 1) Beta regression for ‘linear’ interpretation. More difficult because it requires model search as well. Plus, additional error in modeling posterior probability of original model. 2) Andrews’ curves.
  • 128.
  • 129. Lime: Local Interpretable Model-Agnostic Explanations: Uses surrogate interpretable model on black-box model, applied to observations of interest. Tree representation in this presentation similar to this. (https://homes.cs.washington.edu/~marcotcr/blog/lime/) ICE: Clusters or classification variable applied to PDP results. For given predictor, ICE plots draw one line per obs., representing how instance’s prediction changes when predictor changes. Shap Values: Shapley Additive Explanation (Lundberg et al, 2017): Measures +/- feature contribution to prob. Technique used in game theory to determine each player’s contribution to success in a game. Affected by corrs among predictors  focusing just on one predictor to change behavior may change other predictors as well (available in Python). AND OTHERS ….
  • 130. References Lundberg SM, Lee SI (2017), “Consistent feature attribution for tree ensembles”, presented at the 2017 ICML Workshop on Human Interpretability in Machine Learning (WHI 2017), Sydney, NSW, Australia (https://arxiv.org/abs/1706.06060) Molnar C. (2018): Interpretable Machine Learning, A guide for making black box models explainable, https://christophm.github.io/interpretable-ml-book/ Tolstoy Leo: “The Kingdom of God Is Within You”, (1894)