Predicting Fraud with Logistic Regression

2019-06-20 Leonardo Auslender Copyright 2009
2019-06-20 2
Contents
Present practice in classification methods and interpretation
of probability.
Logistic Regression.
Odds.
Coefficient interpretation and p-values.
Model Performance and Assessment.
Gains Charts, ROC.
Variable Selection.
Issues:
Balanced / unbalanced samples.
Separation.
Canonical Discriminant Analysis

2019-06-20
Present Practice, issues and headaches.
Classification is Data Mining area par excellence. Will focus on
binary targets of events/non-events. Research & applications in:
Clinical data analysis (disease / no-disease, insistence
on odd ratios and logistic regression),
Direct marketing: response/no-response, attrition, etc.
Recommender systems: Interesting / uninteresting.
Fraud, Terrorism, banking
……..
Issues and Headaches (can’t cover them all in this lecture):
-Binary target training/estimation mixture of events/non-events.
-Obfuscating terminology.
-Model comparisons.
-Modeling methodologies confront unexpected issues: co-linearity and separation in
logistic (and neurals?), smooth and non-step response function in trees, etc.

2019-06-20
Present Practice, issues and headaches.
Unclear practices on mixture of 0/1 of target variable for estimation, and
relative cost of misclassification of 0/1.
Obfuscating terminology of ROC, precision, model choice, etc.
Concepts used in classification methods to compare models are derived
from different methodologies (i.e., trees, neurals, etc). Mixture of methods
used in practice.
Unclear practices on separation, co-linearity and variable selection. Co-
linearity likely more of a bete-noire than in linear regression case, and adds
doubts about stability of predicted probability in context of scoring future data
bases.
All models produce predictions in probability or rank probability forms
 decision has to be made about cutoff point or not.
In next pages: Events = “1”, non-Events = “0”
(usually 1, -1 in engineering and science applications).

2019-06-20
Meaning of Probability Statements: context
dependent.
Probability measured in interval (0, 1) + methods are
“mechanical”, i.e., context provided by analyst. E.g:
1. Model estimates household has %70 probability of responding to credit card solicitation.
Solicitation cost is minimal and bad feeling if non-responding customer didn’t want to be
solicited, is disregarded. Likely action: solicit.
2. Model estimates that probability that conference ceiling will fall on us right now is 40%. How
many of you will stay until I finish reading this paragraph? Action: Run for your life? Sue the
presenter?
3. DNA matching asserts that probability ( male A is father of baby) = 95%, i.e., 1 in 20 is False
Positive. Action: A is father?
4. The probability of a devastating earthquake in NYC in the next 24 hours is negligible but still
non-zero. Nobody seems to care for this element of probability, why? Would the same apply in
LA with higher probability?
 Cost (profit) of implementing/not implementing decision, even if not exactly
quantifiable, is most important in context. CONTEXT, CONTEXT, CONTEXT.

2019-06-20
Q.: Give
Examples
Of 0-1 cases
Of interest.

2019-06-20
Binary Dependent Variable – Some Definitions.
1) Standard linear Model Y* = X + , but now Y not continuous:
 usual assumptions and k usually unknown. Ergo, estimate Y*?
Assume k = 0 below for ease of exposition.
Let  = Pr (Y = 1 / X = x) = Pr (X +  ) > 0 = Pr ( > - X) =
= 1 – F (- X), F is CDF of .
Since E () = 0  E (Y*) = X , expanded in terms of Y as:
E (Y) = ∑  i yi =  (1) + (1 - ) (0) =    = X.
Problem: if Y is 0/1   can take only two values, and therefore is
not normal (do we care?).
1 * , * .
0 .
. ., 'good' student
if GPA >3.
{ 
 if Y k Y continuous
Otherwise
e g
Y

2019-06-20
Binary Dependent Variable – Some Definitions to drop linear model.
If Yi = 1  i = 1 – E (Yi) = 1 - Xi = 1 - i, occurs with prob. i,
if Y = 0   = -, which occurs with probability (1 - ): error not normally but
binomially distributed, and its variance not constant (If  = .5  maximum
variance = .25. If . = .01 (in tails)  variance = .0099.   → 1, variance
→ 0)  any predictor affecting mean also affects variance  usual linear
model not applicable because model assumes constant variance.
Obviously, estimated  in Y* above may lie outside [0;1]  est. var. < 0. Still
could rescale predicted values to lie within 0-1 (Foster & Stine, 2004), but
how to compare two models that predict at different values > 1? (rescaling to
(0, 1) is quite ad-hoc.)
More serious reason to drop linear model:
Linear marginal effects: linear model  ΔX on prob (Y) is constant regardless
of initial value of prob( Y). E.g.,: ΔX = 1  Δprob (Y) = .1 whether prob (Y) =
.5 or prob (Y) = .93. Intuitively, Δprob (Y) should be larger for prob (Y) closer
to 0.5.

2019-06-20
Contrived example to motivate marginal effects.
Suppose urn with 50 Red balls and 50 Blue balls. The exercise is to determine
percentage of Red balls in urn, and ‘effort’ required to increase probability of
obtaining a red ball.
“Effort” (i.e., X variable) is number of additional Red balls to add to increase
probability. Starting for 50/50 mixture, X = 0 (i.e., not adding red ball yet),
just gives 50%. To reach 51%, 52%, need to solve for X in equation (B
remains at 50, r = # reds, b = # blues):
Let’s see how to raise prob from .5 to 0.95 in jumps of 0.05 by raising r:
*
1
r b prob
prob r
r b prob
  
 
Obs b prob r
1 50 0.50 50.000
2 50 0.55 61.111
3 50 0.60 75.000
4 50 0.65 92.857
5 50 0.70 116.667
6 50 0.75 150.000
7 50 0.80 200.000
8 50 0.85 283.333
9 50 0.90 450.000
10 50 0.95 950.000
/
.
not constant,
as in linear regression
prob X 
It takes an increasing #
Of red balls to increase
Prob.

2019-06-20
Prob rises from .75 to .8 for red rising from 150 to 200, but going from .90 to .95
requires to add red balls from 450 to 950.

2019-06-20
Binary Dependent Variable – Some Definitions.
For ordinal or nominal dependent variables, coding is arbitrary, and any
transformation gives different results.
 transformation of .
Need non decreasing function of X to unit interval. Function usually
chosen is CDF of unit-normal distribution  PROBIT method, or
standardized logistic distribution,  linear logistic or logit model (more
convenient, no integrals, similar to t-distribution with 7 dfs.). Non-
symmetric: log-log and Complimentary log-log (cloglog used in survival
and interval-censored models).
NB: Purpose is to model probability of occurrence of event.

2019-06-20

Logistic: P1 = 1 / (1 + exp (- X * 3))
Probit: P1 = Inv_norm ( X * 3 * / )
Linear: P1 = .5 + X / 3
Complimentary log-log (extreme value, Gompit, note its
skewness): P1 =
3
In next slide :
   

Not shown :
log log : log( log( ))
(seldom used because of inappropriate
behavior for <.5)
For which link is "best" see Koenker and Yoon
(2009, Journal of Econometrics).
1- exp(-exp (x * 3))
Different functions.

2019-06-20
Binary Dependent Variable – Comparison of approaches.
Cloglog approaches 1 faster.

2019-06-20
Binary Dependent Variable, odds and log odds.
Logistic is easier than probit mathematically.
Note that the logit (‘linear equation’, ) estimates log-odds, not
probability. Log-odds popular in gambling: e.g., odds of 19 to 1 that the ‘house’
will win (equivalent to 95% probability of winning for the house).
π: nonlinear function, requires iterative method of estimation.
NB: Will skip individual parameter inference.
1
[1 ] ,logistic cdf
1
Interpretable as log-odds:
Odds
1
log-odds or logit.log( )
1
X
X
X
X
e
e
e
e
X










 
  

  

  

X 
X 

2019-06-20
X = Unif * 2 * N (9)
Y = 1/ (exp(-1.5*x))

2019-06-20
Example of gambling odds. What is the meaning of 1 / p?
Especially in the case of rare events, that is when the probability of an event is
low, the reciprocal of the probability, 1 / p, provides a 1 in ‘n’ re-scale that
can be informative. For instance, if prob (event) = 0.018, its reciprocal is
about 1 in 552 tries. If the probability distribution is geometric, then when p
= 0.018, you would need 552 flips of a coin to obtain the desired event.
We can relate this topic to odds, which will be useful later on. If odds of winning
a bet are 1.25, and you bet 8, then you get 8 * 1.25 = 10 back when you win
(including the original bet, for a total profit of 2), and nothing in the case of
a loss. The bet would be fair if the probability of winning was 8/ 10 = 0.8,
the reciprocal of which is 1.25.
In the context of survey sampling, the reciprocal of the probability of being
included is called the sampling weight.

2019-06-20
HW: What
Is
Probability?

2019-06-20
Example for
Fraud as
Dependent or
Target variable.

2019-06-20
Fraud Data
Notice
1From .....
**************************************************************** 1
.............. Basic information on the original data set.s: 1
.............. .. 1
.............. Data set name ........................ train 1
.............. Num_observations ................ 3595 1
.............. Validation data set ................. validata 1
.............. Num_observations .............. 2365 1
.............. Test data set ................ 1
.............. Num_observations .......... 0 1
.............. ... 1
.............. Dep variable ....................... fraud 1
.............. ..... 1
.............. Pct Event Prior TRN............. 20.389 1
.............. Pct Event Prior VAL............. 19.281 1
.............. Pct Event Prior TEST ............ 1
***************************************************************** 1

2019-06-20
Model Fit Statistics
Intercept
Intercept and
Criterion Only Covariates
AIC 2320.999 1968.388
SC 2326.767 2008.768
-2 Log L 2318.999 1954.388
R-Square 0.1429 Max-rescaled R-Square 0.2286
Testing Global Null Hypothesis: BETA=0
Test Chi-Square DF Pr > ChiSq
Likelihood Ratio 364.6107 6 <.0001
Score 400.9366 6 <.0001
Wald 268.8171 6 <.0001
Typical output for logistic regression, similar to linear.

2019-06-20
Analysis of Maximum Likelihood Estimates
Standard Wald
Parameter DF Estimate Error Chi-Square Pr > ChiSq
INTERCEPT 1 -0.3390 0.2145 2.4980 0.1140
DOCTOR_VISITS 1 -0.00609 0.00820 0.5517 0.4576
MEMBER_DURATION 1 -0.00657 0.000807 66.4090 <.0001
OPTOM_PRESC 1 0.1429 0.0307 21.5918 <.0001
TOTAL_SPEND 1 -0.00002 6.015E-6 8.1934 0.0042
NO_CLAIMS 1 0.7634 0.0550 192.3966 <.0001
NUM_MEMBERS 1 -0.1089 0.0596 3.3309 0.0680
Odds Ratio Estimates
Point 95% Wald
Effect Estimate Confidence Limits
DOCTOR_VISITS 0.994 0.978 1.010
MEMBER_DURATION 0.993 0.992 0.995
OPTOM_PRESC 1.154 1.086 1.225
TOTAL_SPEND 1.000 1.000 1.000
NO_CLAIMS 2.146 1.926 2.390
NUM_MEMBERS 0.897 0.798 1.008

2019-06-20
Coefficient
Interpretation.

2019-06-20
Coefficient interpretation: Odds, Odds-ratios (non-interactive model).
 
 
  
    
 
 
 
 
   
  


1 2
1 2
( 1 1) 2
2 .
log ( ) when 1 0, 2
ˆ
log( ) log 1 2
ˆ
1,
( / 1 1)
( / 1 1 1)
0,
log ( ) for 1
(
a bX cX
a bx cX
a b x cX
X any value
estimated odds a bX cX
b
Odds e
Odds Y X x
Od
a odds Y X
s Y
X
d X x
odds
OR
ratio Y X
e
e
Odds io XRat




  
 
       
     
  
( 1 1) 2
1 2
.04 1.0408 ( ) 4.08% 1.
1 1 ( ) ( 1)%
1)
b
b
b
a b x cX
a bx cX
If b e Odds Y for X
X Odds Y e
e
e
e

2019-06-20
Coeff. interpretation - dummy var – non-interaction model.
α+β
α+β
α+β
Yε{0,1}, Xε{0,1}....logit(Y) = α + βX + ε
e
Pr(Y=1/X=1)= .............Finding odds
1+e
Pr(Y=1/X=1) Pr(Y=1/X=1)
= =e ,,,,, odds X=1.
Pr(Y=0/X=1) 1-Pr(Y=1/X=1)
Pr(Y=1/X=0)
=......................
Pr(Y=0/X=0)
α
α+β
β
α
α+rβ
........=e ,,,,,,,,, odds X=0.
e
Odds Ratio: ratio of odds= =e :
e
More generally: if Xε{r,s},odds X=r: e (similarly for X = s).
Odds ratio
β is the change in log odds due to one unit change of X.
b s-r
of increase of X from r to s is:{e }

2019-06-20
Coefficient interpretation: Odds-ratios (interactive model).
NB: If X2 measured in deviation terms, say from mean  eb is
odds-ratio for X2 = mean.


  
  
    
    

  
     
 
 
 
   
   
1 2 1 2
2 1 2
( 1 1) 2 ( 1 1) 2
( 1 1) 2 ( 1 1) 2
2
1 2 1 2
ˆ
log( ) . log 1 2 1 2
ˆ
( / 1 1)
( / 1 1 1)
( 1)
a bX cX dX X
a bd cX dx X
a b x cX d x X
a b x cX d x X
b dX
a bx cX dx X
est odds a bX cX dX X
Odds e
Odds Y X x e
Odds Y X x e
e
e
Odds Ratio X e
e
  

( 1), 2 0
.
b
is NOT odds ratio X except when X
odds ratio is conditional on X values

2019-06-20
Coefficient interpretation: Odds-ratios (interactive model).
( 1 2 1)
( 1 1) 2 ( 1 1) 2
2
1 2 1 2
( 1 1) 2 ( 1 1) 2
1
1 2 1 2
ˆ
log( ) . log 1 2 1 2
ˆ
( 1, 2)
( 1)
( 2)


   
    

  
    

  
     
 
 
  
  
b c d X X
a b x cX d x X
b dX
a bx cX dx X
a b x cX d x X
c dX
a bx cX dx X
est odds a bX cX dX X
Odds Ratio X X e
e
e
Odds Ratio X e
e
e
Odds Ratio X e
e
( 1, 2)
( 1) * ( 2)

d OR X X
OR X OR X

2019-06-20
Analysis of Maximum Likelihood Estimates (repeated)
Standard Wald
Parameter DF Estimate Error Chi-Square Pr > ChiSq
INTERCEPT 1 -0.3390 0.2145 2.4980 0.1140
DOCTOR_VISITS 1 -0.00609 0.00820 0.5517 0.4576
MEMBER_DURATION 1 -0.00657 0.000807 66.4090 <.0001
OPTOM_PRESC 1 0.1429 0.0307 21.5918 <.0001
TOTAL_SPEND 1 -0.00002 6.015E-6 8.1934 0.0042
NO_CLAIMS 1 0.7634 0.0550 192.3966 <.0001
NUM_MEMBERS 1 -0.1089 0.0596 3.3309 0.0680
Odds Ratio Estimates
Point 95% Wald
Effect Estimate Confidence Limits
DOCTOR_VISITS 0.994 0.978 1.010
MEMBER_DURATION 0.993 0.992 0.995
OPTOM_PRESC 1.154 1.086 1.225
TOTAL_SPEND 1.000 1.000 1.000
NO_CLAIMS 2.146 1.926 2.390
NUM_MEMBERS 0.897 0.798 1.008
Logistic regr. Output for coefficient evaluation.

2019-06-20
Interpretation.
Intercept: exp(-0.3390) is the odds-ratio 0.71, ratio of probability
of fraud versus non-fraud when all predictors are zero, i.e., no
doctor visits, no prescriptions, zero member duration, etc.
Member Duration: 0.993 (odds ratio point estimate) is the odds
ratio and is less than 1 which means that the ratio of probability of
fraud versus non-fraud is less than 1 the longer the member
duration. Odds ratio is constant as long as there is no interaction
term.
Number of claims odds ratio of 2.146 means that the ratio of the
probability of fraud versus non-fraud increases with the number of
claims, as long as there is no interaction term.
Etc.

2019-06-20
Notice: validation results added.

2019-06-20
Binary Dependent Variable – Log-likelihood. ***
Since observations assumed independent, joint probability is: (and take
logs because it is easier to work with logs):
Y 1 Y
1 n
Log likelihood function
(omitting"i" subscr
Pr(Y ,,,, Y ) (1 ) ,
taking logs, and remembering
that f(Data, ) :
ln( , data) Y ln (1 Y)ln(1 )
ln(
ipt)
, data) Y ln ln(1 )
1
 
 
  

 


 

    
 






35

2019-06-20
Binary Dependent Variable Model evaluation via Inference.
Nested Approach:
Does model with added predictor/s provide “significantly” more information
about dependent variable than model w/o? Typically, H0: fuller model, H1:
constrained model (NB: no notion of model fit as in R2 of regression).
Compare two nested models:
Log (odds) =  + 1x1 + 2x2 + 3x3 + 4x4 (model 1) H0
Log (odds) =  + 1x1 + 2x2 (model 2) H1
Three tests:
–Likelihood ratio Test (LRT): balanced between H0 and H1.
–Wald test: starts at H1 and looks for improvement towards H0.
–Score or Lagrange Multiplier (LM) test: starts at H0 and asks whether
movement towards H1 is improvement. 1st derivative of likelihood function is
called score function.

2019-06-20
Model χ2 and Model Comparisons.
Likelihood-ratio test: used to contrast two models, one of which is subset of other
(i.e., some j’s set to zero, called Nested model Inference).
G0 = 2 (log L1 – log L0)  χ2 (k), k coefficients set to 0 in formulae (called
Deviance, and sometimes residual deviance):
G0 called -2 LLR or 2 LLR (dependent on order of likelihoods).
Bit of confusion: -2 LLR is also deviance, but specific to model vs. saturated model,
and saturated model has LLR = 0.
Is Likelihood same as probability?
In the case of probability, we know the pdf’s (prob. Distr. Function) and parameter values
and want to know probability of specific event/s. I.e., we know betas ….
In case of likelihood, given specific event/s, estimate pdf and/or model parameters. That’s
why, given data, want to estimate parameters that maximize the probability of the event/s
reflected by the data, for an assumed pdf. We don’t know but estimate Betas …
37

2019-06-20
Score (LM) Test.
Let U(β) be vector d LLR / d β: Gradient vector.
Let H(β) be matrix d2 LLR / d β2 Hessian matrix.
Let I(β) = - H(β) or expected value of - H(β).
Let β0 be MLE estimates under H0.
 U’(β0) I(β0) -1 U(β0) ~ χ2(r)

2019-06-20
Example, Y ~ B (n, theta) ***
H0: theta = 0.5, H1: theta = 0.2. Wald based on (0.5 – 0.2), LRT on vertical
distance, LM on slope of line at loglikelihod at theta = 0.2. When loglikelihood
smooth curve, all tests yield same answer.
LOGLIKEL1
-5000
-4000
-3000
-2000
-1000
0
THETA
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
SCORE TEST IS SLOPE AT THETA = 0.2
HALF DEVIANCE
----WALD STATISTIC-------

2019-06-20
Non-nested Model Comparisons:
Akaike Information Criterion (AIC) = Deviance (=-2LRR) + 2k,
k = # model parameters.
BIC (Bayesian information criterion) or SC = Deviance + k log n.
In both cases, smaller  better model.
NOTE: in non-parametric models, such as Tree based models, ‘k’
is not fully defined  no AIC or BIC fully accepted.

2019-06-20
Somer’s D and Pairs Concordance.
P = # possible pairs observations with different values of Y = n0 n1.
% Concordant = % Pairs / Prob (Y = 1 / Y = 1) > Prob (Y = 1 / Y = 0), nc / P.
% Discordant = % Pairs / Prob (Y = 1 / Y = 1) < Prob (Y = 1 / Y = 0), nd / P.
% Tied: % pairs neither concordant nor discordant, nt / P.
(all pairs discordant) -1 ≤ Somer’s D ≤ 1 (all pairs
concordant) = (nc – nd) / (nc + nd + nt),
Gamma = (nc – nd) / (nc + nd)
Stuart Tau a = (nc – nd) / P
c = .5 (1 + Somer’s D), called ROC more popularly (reviewed below).
41

2019-06-20
Hosmer-Lemeshow fit (2000, p. 148).
Posterior probabilities in logistic regression are calculated from
covariate patterns. In data mining, most patterns have very few
observations, and many potential patterns, are empty.
Assume 5 binary predictors in the model  maximum number
of patterns = 2 ^ 5 = 32. Assume only 8 patterns exist (J) in the
data  expected values will be small or 0 in most cells. In data
mining, typically n = J. With continuous predictors, number of
patterns is much larger.
42

2019-06-20
Hosmer-Lemeshow fit “decile of risk” (2000, p. 148).
Proposal: “create patterns” by grouping by percentiles of the posterior
predicted distribution. By simulation, if g = 10 percentiles (deciles)
chosen, distributed as chi-square (g – 2) when J = n. Use
Pearson’s χ2 test, small p-values indicate lack of fit.
Problem: Hosmer-Lemeshow test known to fail with
continuous covariates due to large number of potential
patterns.
Besides, known to produce misgiving results due to ties or order of
observations (Bertolini et al, 2000). Test considered obsolete (Hosmer et al,
1997).
DO NOT USE IT.
43

2019-06-20
2 2
2
2
2
2
2
(0)
: 1 [ ] 1 (0)
ˆ( )
(1991) :
(obtained via RSQUARE option in SAS Proc Logistic,
called Rsquare and Ma
Cox and Snell (1989)
Nagelkerke
Brie
x-Rescaled
r's S
R r
cor
esp
e:
ectively).
1
ˆ(
N N
i
L
L Max R
L
R
Max R
P
n
R

   
2
1
)
n
i
i
Y


What would statisticians do without R2? Create more …

2019-06-20 45
(1)
(2)
(3)
(4)
(5)
(6)

2019-06-20 46
(1) Significant LRT, at least one Beta significantly different
from 0. Analogous to F-test in Linear Regression.
Agreement with (2) and (3). If no agreement, model is in
doubt.
(2) (4) and (5) confirm that intercept-only model is inferior.
(3) In (6), difference between Intercept only model and
Intercept and covariates yields (1).

2019-06-20
Association of Predicted Probabilities and Observed Responses
Percent Concordant 75.4 Somers' D 0.512
Percent Discordant 24.2 Gamma 0.515
Percent Tied 0.4 Tau-a 0.160
Pairs 870504 c 0.756
Partition for the Hosmer and Lemeshow Test
FRAUD = 1 FRAUD = 0
Group Total Observed Expected Observed Expected
1 237 13 10.71 224 226.29
2 237 18 17.08 219 219.92
3 237 15 21.88 222 215.12
4 237 25 26.73 212 210.27
5 237 34 31.53 203 205.47
6 237 36 37.13 201 199.87
7 237 39 43.19 198 193.81
8 237 53 52.97 184 184.03
9 237 84 71.69 153 165.31
10 232 139 143.09 93 88.91

2019-06-20 48

2019-06-20
Concordance Measures.
STUDY
Cloglog Logit Probit
VALUE VALUE VALUE
MEASURE
0.545 0.565 0.565Gamma
Pairs 138659.0 138659.0 138659.0
Percent Concordant
71.400 77.128 77.117
Percent Discordant 21.033 21.430 21.415
Percent Tied 7.567 1.442 1.468
Somers' D 0.504 0.557 0.557
Tau-a 0.245 0.271 0.271
c 0.752 0.778 0.779
Example: Logit and Probit almost identical.
49

2019-06-20
Misclassification Rates (typically assumes 0.5 cutoff)
Confusion Table.
0: “Negative” (non-event) “1”: Positive (event)
TN: True Negative FN: False Negative
FP: False Positive TP: True positive.
“True”, “False” refer to actual state of nature. “Negative” and
“Positive” refer to prediction.
51
Actual Pr edicted 0 1
0 non event TN FP
1 Event FN TP
 


2019-06-20
Misclassification Rates, K-S, ROC (cont. 1).
1) Classification accuracy or overall classification rate: (1 –
classification error rate, converse of misclassification rate), “0” and
“1” Classification Rates:
Overall: TP + TN / (TP + TN + FP + FN).
Assumes known and unchanging “natural” class distribution and that error
cost of FP = errors FN. Typically favors majority class; but in most
applications, cost of misclassifying “1” is higher.
“0” and “1” Classification Rate: TN / (TN + FP) ………..…. “0”
TP / (TP + FN) ……….…. “1”
Misleading results: Assume “1” important. Overall (left model) = 92.5%,
Overall (right model) = 97.5%, but right model misses all “1”s.
Predicted 0 Predicted 1 Predicted 0 Predicted 1
Actual 0 180 15 195 0
Actual 1 0 5 5 0 52

2019-06-20
Misclassification Rates, K-S, ROC
2) Event (“1”), non-Event (“0”) and overall Precisions:
PR(1) = TP / (TP + FP) Event Precision
PR(0) = TN / (TN + FN) Non-Event Precision
PR = (TP + TN) / 2 (Theoretical max for PR (i) = 1).
(of those predicted as “1”, proportion of “true” ones …)
3) Area under Receiver Operating Characteristic Curve
(AUROC): Graph of FP rate (x-axis) vs. TP rate (Y-axis) varying
threshold of classification. Laplace estimate is used (in tree algorithms)
because more consistent improvements in ROC curves.
TP rate = TP / (TP + FN) (sensitivity)
FP rate = FP / (FP + TN) ( 1 – specificity)
TN / (FP + TN) specificity
53

2019-06-20
54
Recall, F-1 …
Recall / Sensitivity / TPR: TP / TP + FN. Percentage of
events well classified.
F1-score / F-measure / F-score: harmonic mean of precision
and recall. Requires that R and P not move opposite to each
other. Useful when costs of misclassification between 0 and
1 are very different. If Costs similar and counts of FN ≈ FP,
can use accuracy. Higher F1, better. Also, weighted F1-score
used to give more importance to P or to R.
2 * (recall * Pr ecision)
F1
Re call Pr ecision


Specificity / TNR: TN / TN + FP

2019-06-20
55
Which measure to use?
If costs similar and counts of FN and FP similar and Recall
does not move opposite Precision, use F1. Also, when event
prior much smaller.
Choose Recall if FN is more important than FP. E.g.,
misdiagnosing cancer patient as healthy.
Choose prediction when want to capture as many positives
as possible. E.g., Spam, prefer to miss SPAM detection as
true, rather than sending good message to spam folder.
Specificity if want to concentrate on non-events and avoid
False alarms (FP). E.g., trial verdict.

2019-06-20
Validation: Rates '-' ==> misclass &
Missprec Predicted Class
0 1 Overall
Class
Rate
Prec
Rate
Class
Rate
Prec
Rate Class Rate
Prec
Rate
Fraudulent Activity
yes/no
Model Name
76.87 88.53 -23.13 -59.64 76.870 M1_TRN_LOGISTIC_NONE
0 M1_VAL_LOGISTIC_NONE 78.31 88.62 -21.69 -61.06 78.31
1 M1_TRN_LOGISTIC_NONE -38.88 -11.47 61.12 40.36 61.12
1 M1_VAL_LOGISTIC_NONE -42.11 -11.38 57.89 38.94 57.89
Overall M1_TRN_LOGISTIC_NONE 88.53 40.36 73.66 73.66
Overall M1_VAL_LOGISTIC_NONE 88.62 38.94 74.38 74.38
Model classifies correctly 76.87% of non-frauds as non-frauds
And misclassifies 23.13%, with probability cutoff point of
50%.

2019-06-20
Heuristically, AUROC = # pairs of observations, one ‘0’ one ‘1’, such that
posterior P(X / Y = 1) > P(X / Y = 0). Max value is 1.
Classification methods typically maximize classification rates 
tendency to focus just on these rates, which does not necessarily
provide good balance between events and non-events.
ROC curves can cross, e.g. data
sets with different proportions of
0/1. NW direction  better model.
B is preferred to C for low FP
rates, but C preferred over B later
on. A clearly inferior.
“Best” model: max AUROC. But
max AUROC may not be ‘best’ for
specific cost and class
distribution. Plus, based on
classification and not precision
measures (detailed below). 61

2019-06-20
ROC:
1) step function built by varying classification threshold.
Different algorithms to ‘smooth’ curves.
2) invariant to any monotonic transformation of posterior
probabilities, because just rankings matter.
62

2019-06-20 63
AUROC comparison: Gradient Boosting better than logistic
For TRN and VAL.

2019-06-20 64

2019-06-20 65

2019-06-20 66
....
Precision P - Recall R and
P R vs. cutoff

2019-06-20 67

2019-06-20 68
Ideally, curve reaches NE corner for better model, and Gradient
boosting arches NE more than Logistic, equivalent to building PR-AUR
and compare areas (not shown). GB better than LG. Only TRN shown.

2019-06-20 69
Standard recommendation
For balanced data (i.e., prior about 50% and equal costs),
choose most outward ROC curve as model of choice.
Otherwise, use Precision Recall curve. Notice that non-events
affect mostly ROC curve.
NOTE: we included Gradient Boosting (reviewed at later
chapter to emphasize model comparisons.

2019-06-20 70

2019-06-20
Explanation via Example: Experian – White Paper.
Lift or gainschart for credit promotion. Table can be created by decile (10%
intervals) or vingtiles (20%) intervals of posterior prob. of being “good”, in
descending order of prob. More often binned by equal sized bins of observations in
descending order of probability. Max potential lift is 1 / event prior percentage.
When comparing different models by deciles, note that underlying probabilities can
be different across different models (e.g., logistic vs Gradient Boosting).
Cum Good % (Bads) measures cumulative % of all Goods (Bads) captured in
corresponding vingtile. % Cum Diff is corresponding difference between goods and
bads, and its largest difference is value of K-S test.
Rule of thumb used: good model should capture at least 70% goods in 5th decile /
10th vingtile.
Lift: ratio of % events in decile/vingtile to overall event rate (# responders in
decile/vingtile / total # responders). Cum lift: Corresponding cumulative.
K-S (Kolmogorov-Smirnov): prob. point at which [Cum % capture events - non-
events] is highest. Frequently used in finance.
71

2019-06-20
Example: Experian – White Paper – Gains Table for goods vs
bads (data set not available).
72

2019-06-20
Gains Table
% Event
Cum %
Events
% Capt.
Events
Cum %
Capt.
Events Lift Cum Lift
Pctl Min
Prob
Max Prob Model Name
59.92 59.92 31.14 31.14 3.11 3.11
10 0.384 0.997 M1_VAL_LOGISTIC_NONE
0.424 1.000 M1_TRN_LOGISTIC_NONE
63.61 63.61 31.24 31.24 3.12 3.12
35.59 47.78 18.42 49.56 1.85 2.48
35.38 49.51 17.33 48.57 1.74 2.43
21.94 39.15 11.40 60.96 1.14 2.03
23.06 40.69 11.32 59.89 1.13 2.00
15.68 33.30 8.11 69.08 0.81 1.73
18.38 35.12 9.00 68.89 0.90 1.72
15.19 29.67 7.89 76.97 0.79 1.54
14.17 30.92 6.96 75.85 0.69 1.52
15.25 27.27 7.89 84.87 0.79 1.41
15.60 28.37 7.64 83.49 0.77 1.39
9.70 24.76 5.04 89.91 0.50 1.28
13.33 26.22 6.55 90.04 0.65 1.29
6.36 22.46 3.29 93.20 0.33 1.17
10.03 24.20 4.91 94.95 0.49 1.19
8.02 20.85 4.17 97.37 0.42 1.08
6.39 22.22 3.14 98.09 0.31 1.09
5.08 19.28 2.63 100.00 0.26 1.00
0.063 M1_TRN_LOGISTIC_NONE
3.90 20.39 1.91 100.00 0.19 1.00

2019-06-20
TRN and VAL results very
similar, no overfitting.

2019-06-20
Selecting Cutoff point (remember ‘k’ of first pages?)
Many criteria, but BRAINS are most important however.
KS (Kolmogorov-Smirnof): Probability point at which events and non-events are most
separated cumulatively.
Cumulative Lift: Probability point for selective cum lift value (arbitrary).
Profit/Cost Business decision: Prob point at which Max Profit / Min Cost.
Many others.
BUT:
1) Consider costs of decision: mail piece to wrong person or wrong HIV
treatment?
2) Are categories under study discrete or were they derived from a
continuous scale, such as deciding default as 3 month late payment. If
so, shouldn’t decision take into consideration how late payment was?
3) If probabilities are too close to decision point, shouldn’t we get more
data to decide?

2019-06-20
Lines at 0.3 and
0.207 are different
Cutoffs. Note that
Proportion of true
Events is higher
than prob
Indicated by
Logistic at
about point 0.3.
***

2019-06-20 79

2019-06-20
Forward selection for logistic regression (Hosmer and Lemeshow, 2000)
Assume ‘p’ available predictors. At step 0 (no variables yet entered), fit “intercept-
only model” and denote log-likelihood by L0. Fit ‘p’ univariate logistic regressions
for each predictor, and obtain ‘p’ log-likelihoods, L1,,,,,Lp.
Calculate ‘p’ likelihood ratio tests, Gj = -2 ( L0 – Lj), j = 1…p, and obtain
corresponding p-values pj, such that Prob [χ2(ν) / G]j) = pj, where ν = 1 if Xj is
continuous, and ν = k – 1, if Xj has k categories.
Choose predictor Xj with minimum p-value that is below the entry ‘alpha’ level, which
usually ranges between .15 and .2. Call this predictor Xj0.
In first step above, logistic regression with Xj0 takes previous role of intercept-only
regression, and new search is started with p -1 candidate predictors.
Search is stopped when no p-value is less than the entry alpha level. While α = 5% is
embedded in many practitioners’ activities, it is known that in variable selection that
level is too stringent. It is usually recommended that “α” level for entry be in the
range from 15% to 20% or even higher dependent on the breadth or exploratory
nature of the study.
Notice that at least in Hosmer and Lemeshow (2000), remaining predictors are not
orthogonalized relative to the one just selected (as in variable selection for linear
regression. Orthogonalization obtained when partialing).

2019-06-20
Stepwise Selection for logistic regression.
1. We proceed in a fashion similar to forward selection in steps 1 and 2, and assume
that we have selected X1 and X2 so far. At this moment, we perform a backward
selection step to check whether we should retain the variables selected earlier. We
accomplish this by creating a model with just X1, obtaining its log-likelihood
and by LRT comparing it to the full model log-likelihood.
2. In the general case of k ‘entered’ variables, the candidate to remove is that
which yields the highest LRT p-value when removed. The comparison is done
against the α-to-remove level that marks the threshold level of explanatory power.
If the highest p-value is higher than the α-to-remove, the variable is removed;
otherwise, it stays in the model.
3. The α-to-remove level must be higher than the α-to-enter to prevent cycles of
the same variable entering and leaving the model. Higher α levels, both to enter
and to be removed, allow for more variables to remain in the model.
4. The search continues adding and removing variables until a) either all ‘p’ candidate
variables have been entered; or b) all the variables in the models have p-values to
remove that are less than the α-to-remove level, and those variables not selected
have p-values higher than the α-to-enter level.

2019-06-20
Logistic Selection Steps
model_name
M1_TRN_L
OGISTIC_
BACKWAR
D
M1_TRN_
LOGISTIC
_STEPWI
SE
M1_TRN_L
OGISTIC_F
ORWARD
#
in
mo
del
P-
value
#
in
m
od
el
P-
value
#
in
m
od
el P-value
Step Effect Entered Effect Removed
5 0.7501 num_members
no_claims
1 0.000 1 0.000
2 member_duration
2 0.000 2 0.000
3 optom_presc
3 0.000 3 0.000
4 total_spend
4 0.001 4 0.001
5 doctor_visits
5 0.019 5 0.019
Summary of entry/removal by 3 selection methods

2019-06-20 84

2019-06-20 85
Doctor_visits Not removed even though insignificant.

2019-06-20 86

2019-06-20
91
GOF ranks
GOF measure
rank
AURO
C
Av
g
Sq
uar
e
Err
or
Cum Lift 3rd
bin
Cum
Resp
Rate
3rd Gini
Rsquare
Cramer
Tjur
Rank
Ra
nk Rank Rank Rank Rank
Unw.
Mean
Unw.
Median
Model Name
1 1 1 1 2 2 1.33 1.00
01_M1_TRN_LOGISTIC_BACKWAR
D
02_M1_TRN_LOGISTIC_FORWARD
2 2 2 2 3 3 2.33 2.00
03_M1_TRN_LOGISTIC_NONE
3 3 3 3 1 1 2.33 3.00
04_M1_TRN_LOGISTIC_STEPWISE
4 4 4 4 4 4 4.00 4.00
GOF ranks
GOF measure
rankAUROC
Avg
Square
Error
Cum
Lift 3rd
bin
Cum
Resp
Rate
3rd Gini
Rsquare
Cramer
Tjur
Rank Rank Rank Rank Rank Rank
Unw.
Mean
Unw.
Median
Model Name
2 2 2 2 2 2 2.00 2.00
05_M1_VAL_LOGISTIC_BACKWAR
D
06_M1_VAL_LOGISTIC_FORWARD
3 3 3 3 3 3 3.00 3.00
07_M1_VAL_LOGISTIC_NONE
1 1 1 1 1 1 1.00 1.00
08_M1_VAL_LOGISTIC_STEPWISE
4 4 4 4 4 4 4.00 4.00

2019-06-20
93
K-S point used as cutoff, mostly in finance.

2019-06-20 95

2019-06-20 96
Model performances are very similar, given data set
simplicity. Auroc equivalent across models and data roles.
Possible to add non-logistic techniques to obtain wider
notion of model performance, in chapter of trees and
Ensembles.

2019-06-20 97

2019-06-20 98

2019-06-20
Balanced / Unbalanced target – Rare Event.
Typical situation: Binary dependent variable with far fewer ‘1’s
than ‘0’s: fraud (less than 1%), extreme diseases, oil spills,
wars, decision to run for office, defective products, etc. Logistic
regression in this case, for instance, underestimates probability
of rare events. Also, tendency to create enormous data bases
to contain ‘rares’.
Typically, misclassification cost of ‘1’s higher than
misclassification cost of ‘0’s. Since classifiers typically aim
at maximizing accuracy, 98% ‘0’s is already very good
accuracy but it could lead to very poor ‘1’ accuracy. And
same for ROC.
99

2019-06-20
Most used methods to deal with Unbalanced samples for
most classification methods:
1) Under-sample ‘0’s,
2) 2) over-sample ‘1’s (by re-sampling with replacement). or
3) both.
4) SMOTE does 3 with special over-sampling of ‘1’s.
5) Use cost functions (usually difficult to establish).
SMOTE: Over-sample by creating synthetic observations from near ‘1’s clusters
(multiply difference between chosen cluster values and original by random
value between 0, 1) and add to original value.
Bing Zhu (Sichuan University), Bart Baesens (KU Leuven) & Seppe
vanden Broucke (KU Leuven) (2017) argue that 50:50 resampling is not
necessary when only focusing on gof measures.
100

2019-06-20 101
Typically, correct classification of “rares” has greater value than that
of “usuals”.
Is estimation affected?
In case of rare events, most applications yield smaller probabilities
than they should be; but for Y=1, probabilities should be larger (i.e.,
0.8 instead of 0.6)  πi(1- πi) (.16 vs .24) and variance (its inverse) is
higher for .8 than for .6  additional ‘1’s (to make probability higher)
cause variance to drop in further and thus ‘1’s bring in ‘more’
information than ‘0’s (King, Zeng 2001).
PROS: Re-sampling is prevailing methodology.
CONS: Ad-hoc procedure, different opinions on re-sampling mixture,
no analysis of effects on coefficients, drops information.

2019-06-20
Balanced/Unbalanced target, comment.
Re-balancing trees (Auslender, 1998, never finished): create samples with
respective percentages of 0/1 equal to 45/55, 46/54,,,,,, 54/46, 55/54. Typically
observe that upper set of levels is similar or same for all samples (split values
and variables) in tree classification (next chapter).
Lower layer typically contains similar variables that are split, sometimes in
different hierarchical order: variable 1 is split in level 4 in sample 45/55, and in
level 5 in sample 50/50, while variable 2 behaves reciprocally.
Conclude: top level is core of tree, and middle level still provides strong
information. After that, information not reliable.
Similar approach possible for logistic regression in context of co-linearity
(co-linearity induces instability in the coefficients of linear models). But see
Owen next.
NB: Balanced/Unbalanced comparison examples in next lecture.
102

2019-06-20 103

2019-06-20
Owen’s 2007 findings and recommendations for logistic.
Owen (2007): Infinitely imbalanced cases, where N0 →∞ and N1 fixed. For
“given” model (i.e., not variable search model), contribution of X cases
when Y = 1 depends entirely on mean of X when Y = 1, by relation:
Practical Implication: For non-outlier data in X/Y = 1, mixture
of 0/1 does not particularly affect estimation, except intercept
that →-∞. If X / Y = 1 clustered, considered splitting analysis by
clusters.
x
0
0x
0
e xdF (x)
X , F : distr. of X given Y 1
e dF (x)


 


104

2019-06-20
SEPARATION.
Separation is observed in fitting process of logistic regression
model if likelihood converges to finite value while at least one
parameter estimate diverges to (plus or minus) infinity.
Separation primarily occured in small or sparse samples with highly
predictive covariates for fixed model in days of yore.
105

2019-06-20
More interesting case:
groups.google.com/group/MedStats/browse
thread/thread3078fd372b83f662
Obs Y x1 x2 S
1 1 29 62 10
2 1 30 83 29
3 1 31 74 18
4 1 31 88 32
5 1 32 68 10
6 2 29 41 -11
7 2 30 44 -10
8 2 31 21 -35
9 2 32 50 -8
Where S = x2 - 2*x1 + 6. Note perfect
separation of Y for S > - 8. Also, S perfectly
co-linear with X1 and X2 obviously.
106

2019-06-20 107
Canonical
Discriminant
Analysis.
***

2019-06-20 108
Canonical Discriminant Analysis (CDA).
(seen in Cluster Analysis lecture) “Canonical” is the statistical
term for analyzing latent variables (which are not directly
observed) that represent multiple variables (which are directly
observed). CDA is related to PCA and Canonical Correlation
Analysis (CCA).
Canonical Variate (CV) is weighted sum of the variables in the
analysis. CCA is useful in analyzing strength of association
between two constructs of continuous variables.
CDA finds linear functions of variables that maximally separate the
means of groups of observations into two or more groups (given by
a nominal target variable), while maintaining variation within
groups as small as possible. PCA summarizes total variation. #
groups = # nominal levels.

2019-06-20 109
Canonical Discriminant Analysis (CDA).
CDA equivalent to canonical correlation analysis between
INTERVAL vars and set of dummy variables coded from
TARGET. It produces ‘k’ Canonical Discriminant Functions
(CDFs) or canonical variables, ‘k’ = min ( # groups – 1, #
variables).
CDF_1 yields max variation between groups w.r.t. within-
group variation  greatest degree of group differences.
CDF_2, uncorrelated to CDF_1, group differences not
captured by CDF_1.
Previously, clustering examples shows2 CDFs, since number
of chosen clusters was 3.

2019-06-20 110

2019-06-20 111
Un-reviewed Additional Classification Methods.
Linear Discriminant Analysis (especially for multi-categorical
dependent variable) but requires normality as strong assumption.
Also Quadratic DA, and Flexible DA.
Naïve Bayes (also useful for continuous dependent
variables).
Support Vector Machines
Neural Networks (also useful for continuous dep var)
Clustering: Reviewed in earlier presentation, can be used as
classification tool.

2019-06-20 112
ID3 and C4, C4.5, PART (briefly mentioned in Tree presentation).
K-Nearest Neighbor
Mixture discriminant analysis
(and review in later chapters):
Classification and Regression Trees (CART)
Bagging CART
Random Forest
Gradient Boosting

2019-06-20
Basic logistic SAS program.
proc logistic data = &data.;
model &depvar. (event= ‘&event.’) = &outvars. / outroc
= troc;
score data=&validation. Out = &val_out outroc=
valoutroc selection = none;
roc; roccontrast;
run;
114

2019-06-20
SAS program:
For ROC curve, obtained from proc logistic, see outroc.
For auroc, use before “proc logistic” call
Ods associations = assoc_out;
After proc logistic run,
Proc sql noprint;
select nvalue2 into :auroc from assoc_out where
upcase(label2) = “C”;
Quit;
%put auroc is &auroc.;
115

2019-06-20
117
1) Explain, technically or non-technically, separation and
Quasi-separation. Solutions for the problem, or is it not a
problem? Does it apply to other supervised methods? Is it
co-linearity?
2) Interactions in logistic as opposed to in linear
regression? Do some research.
3) Unbalanced samples, problem in logistic? Does it make a
difference whether you have a fixed or a searched model?
4) Accuracy vs. precision?

2019-06-20
118
References
Hosmer D., Lemeshow S. (2000): Applied Logistic Regression, Wiley.
Owen A. (2007): Infinitely Imbalanced Logistic Regression, JMLR

2019-06-20
The
End??
119

Predicting Fraud with Logistic Regression

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Predicting Fraud with Logistic Regression

Similar to Predicting Fraud with Logistic Regression (20)

More from Leonardo Auslender

More from Leonardo Auslender (20)

Recently uploaded

Recently uploaded (20)

Predicting Fraud with Logistic Regression