4_5_Model Interpretation and diagnostics part 4.pdf

• Areas to improve
• Principal components main corr part, difficult
to read. Improve on this
• Explain how we got plot of overall prob
against single variable.

Abstract
Statistical and data science models are considered to be, somewhat
pejoratively, black-boxes, interpretation of which has not been
systematically studied.
Molnar’s “Interpretable Machine Learning” is a big effort in finding
solutions. Our presentation is humbler. We aim at presenting visual tools
for model interpretation based on partial dependency plots and their
variants, such as collapsed PDPs created by the presenter, some of
which may be polemical and debatable.
The audience should be versed in models creation, and at least some
insight into partial dependency plots. The presentation will be based on a
simple working example with 6 predictors and one binary target variable
for ease of exposition.
Not possible to detail exhaustively every method described in this
presentation. Extensive document in preparation.
Slides Marked **** can be skipped for easier first reading.

Overall comments and introduction.
Presentation by way of example focusing on Fraud/Default
Data sets and continuing previous chapters.
Aim: study interpretation, diagnosis mostly via Partial
Dependency Plots of logistic regression, Classification
Trees and Regression Boosting.
At present, lots of written opinions and distinctions about
topic. No time to discuss them all. See Molnar’s (2018)
recent book for an overall view.
No discussion about imbalanced data set modeling or
other modeling issues. No discussion on literature, all due
to time constraints.

Interpretation and explanation from whom to whom? ***
Models typically involve multivariate relationships, usually displayed,
summarized and/or measured by specialized statistics. Either
graphically or as tables/formulae.
Graphical representations are easier to “interpret” and “understand”
but not necessarily fully interpretable.
Plus, possible existence of subgroups in data imply larger
interpretations and need of inter group comparisons that can test
patience of ‘to whom’ audience.
Finally, models created by software, not ‘by hand’. Software does not
explain or interpret, it ‘fits’ following computing or algebraic algorithms.
Thus, don’t blame software for poor interpretations.
More importantly, there are EU regulations at present to explain models
under the GDPR “right to explanation” (General Data Protection
Regulation).

Interpretation in context of pre-conceptions:
‘it makes-sense’ . ***
In Middle Ages, world was deemed to be flat, and it ‘made-sense’ .
Pre-conceptions and lack of analytical insight can seriously undermine, if not
mislead, model interpretation.
‘Making sense’ and ‘rules of thumb’, usually based on univariate or at most
bivariate (sometimes causal) relationships (or just convenience), bias
understanding and model creation and selection.
Model Interpretation should not be understood as ‘dumbing down’ of complex
multi-variate relationships but of clear exposition of conditions that lead to an
event, however difficult it may be to disentangle multivariate conditions.
On other hand, modeler should not take refuge in arcane formulae to
hide his/her own superficial understanding of conditions unveiled by
model. Otherwise, the action is called deceit.

“The most difficult subjects
can be explained to the most
slow-witted man if he has not
formed any idea of them
already; but the simplest thing
cannot be made clear to the
most intelligent man if he is
firmly persuaded that he knows
already, without a shadow of
doubt, what is laid before him.”
Leo Tolstoy, “The Kingdom of
God Is Within You” (1894)

Interpretation in context of many competing
models. ***
It is desirable that ONE and just ONE interpretation be the final outcome
of a model search. But just like in criminal detection there may be many
suspects with different or similar motivations, different models may
sometimes be interpreted similarly but often, interpretations are vastly
different.
model interpretation should determine or condition model selection?
Practice: model creation and selection come prior to model
interpretation, assuming competent model creation and interpretation.
Personal preference: If model non-interpretable, NOT A GOOD MODEL.
Classical statisticians however neglected variable selection and ensuing
model uncertainty.

Model Interpretation categorization.
Just as in EDA (but on model results, not on initial data), three
types:
Univariate Model Interpretation (UMI): One variable at a
type. EASIEST to understand and huge source of “makes
sense”. E.g., Classical linear models interpretations. E.g.,
reasons to decline a loan.
Bivariate Model Interpretation (BMI): Looking at pairs of
variables to interpret model results.
Multivariate Model Interpreation (MMI): Overall model
interpretation, most difficult.
Typically, most work results in UMI and perhaps BMI.

Days of Linear Regression Interpretation ***
Based on “ceteris paribus” assumption that fails In case of
Even relatively small VIFs. At present, rule of thumb VIF >=
10 (R-sq = .90 among predictors)  unstable model.
“Ceteris paribus” exercise: Keeping all other predictors
Constant, an increase in …. But if R-sq among predictors is
Even 10%, not possible to keep all predictors constant while
Increasing by 1 the variable of interest.
Advantages: EASY to conceptualize because practice
Follows notion of bivariate correlation.
But notion is generally wrong in multivariate case..

Corr (X,Y) = if SD(Y) = SD(X). E.g., if both
Standardized, otherwise same sign at least, and
interpretation from correlation holds in simple regression
case.
Notice that regression of X on Y is NOT inverse of
regression of Y on X because of SD(X) and SD(Y).
/
Confusion on signs of coefficients
and interpretation.
( )
ˆ {
( )
} ˆ
( ) ( )
y
i
xy xy
x
i
xy
Y X
s
Y Y
r r
s
X X
sg r sg
  
 
  

  




2
1 2
2

17 2019-05-10
In multiple linear regression, previous relationship does
not hold because predictors can be correlated (rxz)
weighted by ryz, hinting at co-linearity and/or relationships
of supression/enhancement 
. .
. 2
2
But in multivariate, e.g.: ,
estimated equation (emphasizing "partial")
and for example:
ˆ ˆ ˆ ,
ˆ
1
ˆ
( ) ( )
( ) ( ) and 1
YX Z YZ X
Y YX YZ XZ
YX Z
X XZ
YX
YX YZ XZ XZ
Y X Z
Y a X Z
s r r r
s r
sg sg r
abs r abs r r r
   
 
 
   

  



 
  

Comment
Even in traditional UMI land, we find that
multivariate relations given by Partial- and semi-
partial correlations must be part of the
interpretation.
Note that while correlation is a bivariate
relationship, partial and semipartial corrs can be
extended to multivariate setting.
However, even BMI and certainly MMI not so often
performed.

Searching for Important variables en route to answering
modeling question.
Case study: minimum components to make a car go
along highway.
1) Engine
2) Tires
3) Steering wheel
4) Transmission
5) Gas
6) ….. Other MMI aspects and interrelations.
Take just one of them out, and the car won’t drive. There is no
SINGLE most important variable but a minimum irreducible set of
them. In Data Science case with n  ∞, possibly many subsets of
‘important’ variables.
But “suspect VARS” good starting point of research.

Model Name Item Information
M2 TRN DATA set train
. TRN num obs 3595
VAL DATA set validata
. VAL num obs 2365
TST DATA set
. TST num obs 0
Dep. Var fraud
TRN % Events 20.389
VAL % Events 19.281
TST % Events

Original Vars + Labels Model
Name
M2
Variable Label
**
DOCTOR_VISITS Total visits to a doctor
MEMBER_DURATION Membership duration
**
NO_CLAIMS No of claims made recently
**
NUM_MEMBERS Number of members covered
**
OPTOM_PRESC Number of opticals claimed
**
TOTAL_SPEND Total spent on opticals
**

Requested Models: Names & Descriptions.
Full Model Name Model Description
Overall Models
M2 20 pct prior
M2_BY_DEPVAR Inference
01_M2_GB_TRN_TREES Tree Repr. for Gradient Boosting
02_M2_TRN_GRAD_BOOSTING Gradient Boosting
03_M2_TRN_LOGISTIC_STEPWISE Logistic TRN STEPWISE
04_M2_VAL_GRAD_BOOSTING Gradient Boosting
05_M2_VAL_LOGISTIC_STEPWISE Logistic VAL STEPWISE

— 26 —
Data set: Definition by way of Example
• Health insurance company:
Ophtamologic Insurance Claims
• Is claim valid or fraudulent? Binary
target.
• No transformations created to have
simple data set.
• Full description and analysis of this data
set in
https://www.slideshare.net/LeonardoAuslender
(lectures at Principal Analytics Prep).

Ch. 1.1-27
Alphabetic List of Variables and Attributes
# Variable Type Len Format Informa
t
Label
3 DOCTOR_VISITS Num 8 BEST12
.
F12. Total visits to a doctor
1 FRAUD Num 8 BEST12
.
F12. Fraudulent Activity yes/no
5 MEMBER_DURAT
ION
Num 8 Membership duration
4 NO_CLAIMS Num 8 BEST12
.
F12. No of claims made recently
7 NUM_MEMBERS Num 8 Number of members covered
6 OPTOM_PRESC Num 8 BEST12
.
F12. Number of opticals claimed
2 TOTAL_SPEND Num 8 BEST12
.
F12. Total spent on opticals
Note: No nominal predictors. No transformations to keep presentation simple
But not simpler than necessary

....
Reporting area for
all Model.s
coefficients
Importance, etc. and
Selected Variables.

Vars * Models * Coeffs
Model Name
M2_TRN_GRAD_BO
OSTING
M2_TRN_LOGIS
TIC_STEPWISE
M2_VAL_GRAD_BO
OSTING
M2_VAL_LOGISTIC_S
TEPWISE
Coeff /
Importanc
e
PVal /
Nrules
Coeff /
Import
ance
PVal /
Nrules
Coeff /
Importa
nce
PVal /
Nrules
Coeff /
Importan
ce
PVal /
Nrules
Variable
0.1099 2.000 0.1099 2.000
NUM_MEMBERS
OPTOM_PRESC
0.6211 19.000 0.2178 0.000 0.6211 19.000 0.1463 0.000
DOCTOR_VISITS
0.4434 20.000
-
0.0171 0.020 0.4434 20.000 -0.0065 0.428
MEMBER_DURATION
0.7843 41.000
-
0.0066 0.000 0.7843 41.000 -0.0065 0.000
TOTAL_SPEND
0.6864 29.000
-
0.0000 0.003 0.6864 29.000 -0.0000 0.004
NO_CLAIMS
1.0000 19.000 0.7752 0.000 1.0000 19.000 0.7610 0.000
INTERCEPT -
0.5767 0.000 -0.5635 0.001

Logistic Selection Steps
Model
Name
M2_TRN
_LOGIST
IC_STEP
WISE
# in
mo
del
P-
val
ue
Step Effect Entered Effect Removed
1 .00
1 no_claims
2 member_duration
2 .00
3 optom_presc
3 .00
4 total_spend
4 .00
5 doctor_visits
5 .02
Dropped Num_members.

Mg: Effect:
Change in Prob as X changes.

Some conclusions and comments so far:
. Logistic stepwise dropped Num_members that is shown
with lowest relative importance in GB. Notice that Logistic
Regression does not have agreed-upon scale of importance.
We can use odds-ratios, e.g.
. NO_CLAIMS is deemed most important single variable for
GB, but logistic deems OPTOM_PRESC as the second one
(via odds ratios), while GB selected MEMBER_DURATION.
. Remaining variables have odds ratios of 1 which seem to
indicate similar effect, while GB distinguishes relative
importance after first two variables.

LG does not have measure of importance, as GB does 
we use marginal effects plots that indicate change in
probability along variable range. Except for Member
duration (that declines initially), other effects have
positive effect of different intensity and max value declines
as per logistic shape. Member duration has pronounced
decline for low duration levels,  possibility of fraudulent
members who join, commit their fraud and leave.
Note sharper increase in prob. for no_claims at bins 1 and
8. Optom_presc at 6.
GB Importance measures impact of individual inputs on
predicting Y, but don’t tell how impact changes along
range of inputs and individual variable effects are not in
consideration
 Use Partial Dependency Plots, also for LG in free ride.

Marginal Effects and PDPs
Marginal effects refer to change in probability with one
unit change in X, ceteris paribus (if meaningful or at
least desirable).
PDPs do not indicate change in Y at all; instead, PDP
measures probability levels at different values of X1
measuring all other predictors at their means (or modes,
medians, etc.).
 No Marginality in PDPs, unless we measured ‘change’
in probability as well. Shown later on, called Marginal
PDPs.

Tree representation(s) up to 4 levels Model M2_GB_TRN_TREES
Requested Tree Models: Names & Descriptions. Pred
Level 1 + Prob. Level 2 + Prob. Level 3 + Prob. Level 4 + Prob.
0.464
no_claims < 2.5 ( 0.185 ) no_claims < 0.5 (
0.159 )
member_duration
< 180.5 ( 0.199 )
total_spend < 5250
( 0.464 )
total_spend >=
5250 ( 0.186 ) 0.186
member_duration
>= 180.5 ( 0.103 )
doctor_visits >=
5.5 ( 0.093 ) 0.093
doctor_visits < 5.5
( 0.126 ) 0.126
no_claims >= 0.5 (
0.321 )
optom_presc < 3.5
( 0.291 )
total_spend >=
6300 ( 0.273 ) 0.273
total_spend < 6300
( 0.467 ) 0.467
optom_presc >=
3.5 ( 0.59 )
member_duration
< 154.5 ( 0.67 ) 0.670
member_duration
>= 154.5 ( 0.447 ) 0.447
no_claims >= 2.5 ( 0.633 ) no_claims < 4.5 (
0.57 )
optom_presc < 3.5
( 0.54 )
member_duration
>= 128.5 ( 0.498 ) 0.498
member_duration
< 128.5 ( 0.627 ) 0.627
optom_presc >=
3.5 ( 0.81 )
member_duration
>= 137 ( 0.785 ) 0.785
member_duration
< 137 ( 0.85 ) 0.850
no_claims >= 4.5 (
0.761 )
member_duration
< 303.5 ( 0.778 )
member_duration
>= 148 ( 0.757 ) 0.757
member_duration
< 148 ( 0.823 ) 0.823
Missing one line.

Tree representation(s) up to 4 levels Model M2_LG_TRN_TREES
Requested Tree Models: Names & Descriptions. Pred
Level 1 + Prob. Level 2 + Prob. Level 3 + Prob. Level 4 + Prob.
0.195
no_claims < 1.5 ( 0.164 ) member_duration < 155.5 (
0.235 )
optom_presc < 3.5 ( 0.213 ) no_claims < 0.5 ( 0.195 )
no_claims >= 0.5 ( 0.337 ) 0.337
optom_presc >= 3.5 ( 0.49 ) optom_presc < 6.5 ( 0.404 ) 0.404
optom_presc >= 6.5 ( 0.647
) 0.647
member_duration >= 155.5
( 0.111 )
optom_presc < 3.5 ( 0.103 ) member_duration >= 246.5
( 0.065 )
0.065
member_duration < 246.5 (
0.122 ) 0.122
optom_presc >= 3.5 ( 0.235
)
no_claims >= 0.5 ( 0.353 ) 0.353
no_claims < 0.5 ( 0.213 ) 0.213
no_claims >= 1.5 ( 0.61 ) no_claims < 2.5 ( 0.451 ) member_duration < 155.5 (
0.562 )
optom_presc >= 1.5 ( 0.651
) 0.651
optom_presc < 1.5 ( 0.493 ) 0.493
member_duration >= 155.5
( 0.353 )
member_duration >= 237 (
0.204 )
0.204
member_duration < 237 (
0.39 ) 0.390
no_claims >= 2.5 ( 0.748 ) no_claims < 4.5 ( 0.675 ) member_duration >= 236.5
( 0.477 )
0.477
member_duration < 236.5 (
0.721 ) 0.721
no_claims >= 4.5 ( 0.899 ) member_duration >= 272 (
0.741 )
0.741
Missing one line.

Comment
LG starts by splitting on NO_CLAIMS >= 2 as likely fraud,
while GB >= 3. Predictions for first level across 2 models are
similar ( .185 ; .633) for GB vs. (.164, .61) for LG, which
indicates that the structures identified so far are similar.
While in 2nd level, GB only splits on NO_CLAIMS, LG splits
on MEMBER_DURATION for suspected non-fraudsters in the
first stage and on NO_CLAIMS for the fraudster suspects.
Predictions are similar only for the 4th node in level 2 (.748
and .761) but different otherwise. The careful reader may
verify that these two predictions emerge by splitting on
no_claims albeit at different values, which supports the
notion of No_claims being the leading clue in our research.

NO_CLAIMS is not so heavily used after the 2nd level
however, and the structures of the models are clearly
different. GB does not use it at all, while LG splits at 4.5 to
produce the highest prediction level of .899. While GB did
split initially on No_claims 2.5 and then on 4.5, it did not
reach the same level of prediction as LG that started
splitting at 1.5.
By going to the marginal effects plot, we can see that
No_claims has the largest slope for low values, but
member_duration has the highest for highest value of the
variable. But no similar plots can be created for GB.
Thus, found structures and consequent interpretations differ
and there is no isomorphism from one into the other.
Perhaps fractal approximation?

2nd most “important variable”, very different structures.

1st most “important” variable, very different structures.

Great! similar posterior probabilities, different structures but maybe similar
Interpretations? Note more discrepancies when Prob is higher.

....
Ranking the
models
by GOF.
Strongly summarized area for brevity sake, and just for completion.

GOF ranks
GOF measure
rank
AUR
OC
Avg
Squa
re
Error
Class
Rate
Cum
Lift
3rd
bin
Cum
Resp
Rate
3rd Gini
Preci
sion
Rate
Rsqu
are
Cram
er
Tjur
Ran
k
Ran
k
Ran
k
Ran
k
Ran
k
Ran
k
Ran
k
Ran
k
Unw
.
Mea
n
Unw
.
Medi
an
Model Name
1 1 2 1 1 1 1 1 1.13 1
02_M2_TRN_GRAD_BOOST
ING
03_M2_TRN_LOGISTIC_ST
EPWISE 2 2 1 2 2 2 2 2 1.88 2
GOF ranks
GOF measure
rank
AUR
OC
Avg
Squa
re
Error
Class
Rate
Cum
Lift
3rd
bin
Cum
Resp
Rate
3rd Gini
Preci
sion
Rate
Rsqu
are
Cram
er
Tjur
Rank Rank Rank Rank Rank Rank Rank Rank
Unw.
Mean
Unw.
Medi
an
Model Name
1 1 2 1 1 1 1 1 1.13 1
04_M2_VAL_GRAD_BOOS
TING
05_M2_VAL_LOGISTIC_S
TEPWISE 2 2 1 2 2 2 2 2 1.88 2

....
Profile and
Model Interpretation
Area.
....
Univariate Profile diagnostics
for 6
Important Vars.

....
Event Proportions
and Posterior Probabilites
for 5
Important Vars.
by original
Model Names.
Variables and probabilities binned for ease of
visualization. Proportion events same across models (it’s
just original data), but probabilities differ across models.
Not all cases shown.

Some observations
Binned No_claims: While similar in shape, GradBoost seriously
underestimates proportion of events throughout, while logistic has the
problem for bins 2, 3, 5, 6, 7. Logistic has a positive slope, while GB
flattens due to interactive GB model. Up to bin 7, similar behavior for
GB and LG, and then LG jumps to higher level of probability.
Binned Member_Duration: Probability distributions are similar but not
identical. For bins 1, 2, 3 and 16 methods underestimate proportion of
events. Slightly declining slope for both models.
Binned OPTOM_PRESC: Both methods failed to match proportion of
events in the mid range of the bins. Sudden positive upshift in
positive slope for GB starting at bin 15, while overall flat but positive
slope for Logistic.

....
Rescaled Variables
along binned
Posterior
Probability.

Interpretation: In bin 5, No_claims reaches overall max (100), while for bin 1 max is around
35 and 15 in 0-100 scale for respective models. Same interpretation for Q3, etc.

And Conversely ….. (GB = Tree repr. Of Grad_boosting …)

....
Partial Dependency
Plots and variants for
Non Ensemble
Models.
Some variables may be dropped due to computer resources.

Note the narrow range of GB PDPs compared to those of LG due to GB
interactive nature  more difficult to interpret.

Marginal (1) PDP comparative notes
(Marginal (1): one var at a time. Could also marginalize two vars at a time, not
done in this presentation).
BG marginals are rather flat, except for
MEMBER_DURATION, of which caveat later on.
LG is juicier, NO_CLAIMS increase of probability declines
along range, but OPTOM_PRESC increases, which seems to
indicate that leading reason would be prescriptions and not
overall claims.
Corresponding marginals for logistic end up with slowing
down of growh due to logistic shape. BG is not constrained
in that way.

PDP comparative notes
Overall PDP is Model probability when all predictors are at their means.
For LG, it’s about .17, while GB is .53. Individual PDPs (by def.) are
deviations from Overall when var of interest measured along its range,
while others remain at mean values. GB clumps most PDPs around
Overall, LG clearly distinct values instead.
Highest probability level for GB is around .7 while LG reaches 1, and
minima are around .6 and 0 respectively. Note LG monotonicity while
GB is mostly monotonic (except for Doctor_visits), possibly product of
data set created artificially.
In both cases, NO_CLAIMS appears as leading variable, especially in LG
but while Member_duration is rather flat in GB, it certainly declines
steadily in LG, with a very different interpretation. Longer member
duration implies steadier customer and familiarity. No_members had
been excluded in LG’s stepwise and should not be confused with
MEMBER_DURATION.

....
Now, mix
All previous
Probability
and PDPs
Together.

Member_duration alone brags too much.

UMI: Univariate Model Interpretation.
From preceding pages, we can conclude that:
No_claims: positively associated with increased fraud,
for both logistic and grad, but far steeper slope in
Logistic. Grad_b stays in narrow band of probabilty and
more interactive with other predictors  Grad_b requires
more BMI and MMI. GB’s PDP overshoots posterior
probability  other vars bring down this effect in GB.
Member_duration has a U shape relationship,
especially in Logistic case, while GB has a more spiky
one. Note the high spike at duration minimal and
immediate decline which seems to indicate members that
committed fraud as soon as they joined and left
immediately.

UMI: Univariate Model Interpretation (cont. 1)
PDP view: logistic shows positive effects of NO_CLAIMS
and OPTOM_PRESC, balanced by negative effects of
remaining variables.
Comparing posterior probability with No_claims PDP,
they are almost the same for Logistic. Similarly for
MEMBER_DURATION.
Grad_b instead shows more tepid effects of same
variables, and almost unchanging effects of remaining
predictors. Comparing PDP with probability, other
predictors bring down PDP of No_claims. Similar effects
for MEMBER_DURATION.

....
PDPs for
"Pairs of
Variables"
Note: 3d plots tend to interpolate areas of no data producing false
expectations of results. Thus, sometimes 3d charts preferable to
plots.
Not all Pairs of variables available due to computer resources.

Same for LG.
Note correlations (No_claims) with other variables are relative small when
compared to the pair Member_duration – Doctor_visits and Member_duration -
optom_presc. How will this translate in PDPs for 2 variables at a time?

M_: 'M2_TRN_LOGISTIC_STEPWISE' 'ORIGINAL' PDP Corr '-0.02542'

BINNED ORIGINAL PDP M2_TRN_GRAD_BOOSTING Corr ' 0.05073' NO_CLAIMS
DOCTOR_VISITS
Combination of No_claims & Doctor_visits shows high probability at NE corner
and middle section stable high prob. level. Too many charts to show but
necessary for full interpretation.

BINNED ORIGINAL PDP M2_TRN_GRAD_BOOSTING Corr ' 0.02549' NO_CLAIMS
MEMBER_DURATION

BINNED ORIGINAL PDP M2_TRN_LOGISTIC_STEPWISE Corr ' 0.06580'
NO_CLAIMS OPTOM_PRESC

BINNED ORIGINAL PDP M2_TRN_GRAD_BOOSTING Corr '-0.10759' MEMBER_DURATION
OPTOM_PRESC

BINNED ORIGINAL PDP M2_TRN_LOGISTIC_STEPWISE Corr '-0.10759' MEMBER_DURATION
OPTOM_PRESC

Some BMI comments
LG High levels of NO_CLAIMS have high probability for lowest level
of total_spend which probably denotes one time fraud.
Otherwise, even mid levels of NO_CLAIMS associated with high
probability for any level of TOTAL_SPEND. It seems that FRAUD is
not necessarily linked to TOTAL_SPEND alone.
About NO_CLAIMS and MEMBER_DURATION, fraud happens for low
level of duration, after which fraudsters leave.
About the pair Optom_presc and Member_duration for which we have
contrasting Pair PDPs, with corr = -0.10, interpretation is very
different. While Grad_b shows flat probabilities throughout, except in
NE empty corner, logistic shows more extreme NE corner, plus
declining probabilities from NW top.

For 2 dimensional visualization, collapse 3d chart by averaging
levels of variable 2 into those of variable 1 and compare to original
PDP.
Original and collapsed PDPs are derived from posterior model
probabilities.

Comments for BMI.
In case of NO_CLAIMS, all cases show overlapping of
collapsed and Original, except for OPTOM_PRESC. In GB
case, DURATION brings down probability slightly because
duration is itself strong predictor.
LG shows that presence of OPTOM_PRESC raises
posterior probability, not so accentuated in GB. LG model
could benefit of NO_CLAIMS and OPTOM_PRESC
interaction or possibly overall transformation by way of
obtaining information per month and per number of
members. (LG chart with TOTAL_SPEND omitted for space
brevity).
MEMBER_DURATION shows overlap with all second
variables, plus declining slope, more evident in LG
models.

Comments for BMI (cont).
It is possible to obtain 3-way and higher PDPs, and also
collapse them, not tried here.
Given overlap between Original and CPDP, UMI effects are
correct so far, except possibly for triplet NO_CLAIMS,
MEMBER_DURATION and OPTOM_PRESC.

P-comp1 mostly fitted by doctor_visits and member duration. # 2 (which fits
residuals from step 1 ) by No_claims and optom_presc, etc.

1.1-93 2019-05-10
And discriminating by Fraud levels 

For Illustration: No_claims for fraud = 1 still highly correlated with second
eigenvector.

Member_duration cannot compete with NO_CLAIMS.

Overall View and omitting # 2 and # 3 for brevity sake.

Comments for PCA results.
PC # 1 shows MEMBER_DURATION and DOCTOR_VISITS
grouped together, NO_CLAIMS and TOTAL_SPEND in
another, and remaining in separate groups (separation
can be proven by statistical inference).
For logistic case, note that NO_CLAIMS is first ‘entered’
variable in Stepwise selection (earlier slides), followed by
MEMBER_DURATION and OPTOM_PRESC. Note that
NO_CLAIMS does not have largest correlation with first
component, even when looking at correlations by values
of FRAUD.
And GB also has NO_CLAIMS and MEMBER_DURATION,
not represented in hierarchy of Principal Components
Analysis. PCA does not provide framework for
interpreting models.

Comments for PCA results (cont.)
PCA orthogonalizes away predictors effects when going
from step to step, not done by our present modeling
methods.
Having chosen cutoff point in posterior probability,
possible to obtain similar PCA results for predicted 0 and
predicted 1, obtain correlations and compare to previous
results.
Possible to use statistical inference to determine
equality/inequality of correlations (with original results)
for different cutoff points.

Comments for Statistical Inference – Multiple Comparisons.
Bars below ‘0.05’ considered to be significant, having taking
multiple comparisons effects into consideration.
Most results are insignificant, but significant results differ
across models: NO_CLAIMS provides same information
throughout range of Probs for GB, but LG finds first bin to
be significantly different than the rest. GB results come from
splitting GB search. Thus, LG states that lower level
probability indicates NO_FRAUD, GB cannot state that.

Definite monotonic relationship, higher values of NO_CLAIMS (e.g., bin 5)
associated with higher probability. Note differences between LG and BG.

GB monotonic, LG slighly U-shaped.

Similar coeffs among logistic , GB and beta regressions. By Beta Regression,
standard interpretation of log odds is possible, with caveats.
Vars * Models *
Coeffs
Model Name
M2_TRN_
GRAD_BO
OSTING
M2_TRN_
GRAD_BO
OSTING_
BETA_RE
G
M2_TRN_
LOGISTIC
_STEPWI
SE
M2_TRN_
LOGISTIC
_STEPWI
SE_BETA_
REG
M2_VAL_
GRAD_BO
OSTING
M2_VAL_
LOGISTIC
_STEPWI
SE
Coeff /
Importan
ce
Coeff /
Importan
ce
Coeff /
Importan
ce
Coeff /
Importan
ce
Coeff /
Importan
ce
Coeff /
Importan
ce
Variable
0.7318 -0.0051 -0.0057 -0.0057 0.7318 -0.0084
MEMBER_DURATIO
N
DOCTOR_VISITS
0.3925 -0.0061 0.3925
TOTAL_SPEND
0.6610 -0.0000 -0.0000 -0.0000 0.6610 -0.0000
OPTOM_PRESC
0.5944 0.1713 0.2132 0.2132 0.5944 0.1634
NO_CLAIMS
1.0000 0.7027 0.7921 0.7921 1.0000 0.7351
SCALE
21.8895
2.590291
516E16
INTERCEPT
-0.7979 -0.8352 -0.8352 -0.3111

Beta Regression results
Results of analyzing posterior probabilities (i.e.,
original GB and LG posteriors) via BETA
regression show very similar coefficients and
structures  Beta is reassuring but not providing
additional information.

Possible to say that …
1) Manipulate just NO_CLAIMS and problem
solved?
2) Maybe add MEMBER_DURATION and
OPTOM_PRESC for parts of NO_CLAIMS
range?
3) Maybe add if-then rules from simplified
TREE_REPRESENTATION because easier than
GB and more interactive than LG?
4) If using Neural Network, and NN derivatives
abandon all hope of interpretation?
5)  Interpretation needs definition of INTENDED
AUDIENCE (see Tolstoy ut supra).

Possible to say that … (cont. 1)
1) The analyst needs to focus on NO_CLAIMS,
MEMBER_DURATION and OPTOM_PRESC as
an ‘IMPORTANT’ group.
2) Different model Interpretations should be
entertained.
3) Different marginal effects must be explained.

Final thoughts before I exhaust the audience, if
not exhausted already.
MI analysis can proceed further obtaining
insights from collapsing three way PDPs for
instance.
If preferred ‘easier’ linear model explanation, beta
regression on posterior probability would provide
regression like information. Still, beta regression
is not straightforward and model selection is big
issue.

Ch. 1.1-114 2019-05-10
Future steps
Focus on MMI
1) Collapsing higher PDP orders, i.e., 3 way variables
and interpreting.
2) Beta regression for ‘linear’ interpretation. More
difficult because it requires model search as well.
Plus, additional error in modeling posterior
probability of original model.
3) Andrews’ curves.

Lime: Local Interpretable Model-Agnostic Explanations:
Uses surrogate interpretable model on black-box model, applied to observations of
interest. Tree representation in this presentation similar to this.
(https://homes.cs.washington.edu/~marcotcr/blog/lime/)
ICE: Clusters or classification variable applied to
PDP results. For given predictor, ICE plots draw one line per obs.,
representing how instance’s prediction changes when predictor
changes.
Shap Values: Shapley Additive Explanation (Lundberg et
al, 2017): Measures positive or negative feature contribution to posterior
probability, technique used in game theory to determine each player’s
contribution to success in a game. Affected by correlations among
predictors  focusing just on one predictor to change behavior may
change other predictors as well (available in Python).
AND OTHERS ….

References
Lundberg SM, Lee SI (2017), “Consistent feature
attribution for tree ensembles”, presented at the 2017
ICML Workshop on Human Interpretability in Machine
Learning (WHI 2017), Sydney, NSW, Australia
(https://arxiv.org/abs/1706.06060)
Molnar C. (2018): Interpretable Machine Learning, A
guide for making black box models explainable,
https://christophm.github.io/interpretable-ml-book/
Tolstoy Leo: “The Kingdom of God Is Within You”, (1894)

4_5_Model Interpretation and diagnostics part 4.pdf

4_5_Model Interpretation and diagnostics part 4.pdf

Recommended

Recommended

More Related Content

Similar to 4_5_Model Interpretation and diagnostics part 4.pdf

Similar to 4_5_Model Interpretation and diagnostics part 4.pdf (18)

More from Leonardo Auslender

More from Leonardo Auslender (20)

Recently uploaded

Recently uploaded (20)

4_5_Model Interpretation and diagnostics part 4.pdf