SlideShare a Scribd company logo
1 of 119
Download to read offline
2019-06-20
Ch. 1.3-1
2019-06-20 Leonardo Auslender Copyright 2009
2019-06-20 2
Contents
Present practice in classification methods and interpretation
of probability.
Logistic Regression.
Odds.
Coefficient interpretation and p-values.
Model Performance and Assessment.
Gains Charts, ROC.
Variable Selection.
Issues:
Balanced / unbalanced samples.
Separation.
Canonical Discriminant Analysis
2019-06-20
Present Practice, issues and headaches.
Classification is Data Mining area par excellence. Will focus on
binary targets of events/non-events. Research & applications in:
Clinical data analysis (disease / no-disease, insistence
on odd ratios and logistic regression),
Direct marketing: response/no-response, attrition, etc.
Recommender systems: Interesting / uninteresting.
Fraud, Terrorism, banking
……..
Issues and Headaches (can’t cover them all in this lecture):
-Binary target training/estimation mixture of events/non-events.
-Obfuscating terminology.
-Model comparisons.
-Modeling methodologies confront unexpected issues: co-linearity and separation in
logistic (and neurals?), smooth and non-step response function in trees, etc.
2019-06-20
Present Practice, issues and headaches.
Unclear practices on mixture of 0/1 of target variable for estimation, and
relative cost of misclassification of 0/1.
Obfuscating terminology of ROC, precision, model choice, etc.
Concepts used in classification methods to compare models are derived
from different methodologies (i.e., trees, neurals, etc). Mixture of methods
used in practice.
Unclear practices on separation, co-linearity and variable selection. Co-
linearity likely more of a bete-noire than in linear regression case, and adds
doubts about stability of predicted probability in context of scoring future data
bases.
All models produce predictions in probability or rank probability forms
 decision has to be made about cutoff point or not.
In next pages: Events = “1”, non-Events = “0”
(usually 1, -1 in engineering and science applications).
2019-06-20
Meaning of Probability Statements: context
dependent.
Probability measured in interval (0, 1) + methods are
“mechanical”, i.e., context provided by analyst. E.g:
1. Model estimates household has %70 probability of responding to credit card solicitation.
Solicitation cost is minimal and bad feeling if non-responding customer didn’t want to be
solicited, is disregarded. Likely action: solicit.
2. Model estimates that probability that conference ceiling will fall on us right now is 40%. How
many of you will stay until I finish reading this paragraph? Action: Run for your life? Sue the
presenter?
3. DNA matching asserts that probability ( male A is father of baby) = 95%, i.e., 1 in 20 is False
Positive. Action: A is father?
4. The probability of a devastating earthquake in NYC in the next 24 hours is negligible but still
non-zero. Nobody seems to care for this element of probability, why? Would the same apply in
LA with higher probability?
 Cost (profit) of implementing/not implementing decision, even if not exactly
quantifiable, is most important in context. CONTEXT, CONTEXT, CONTEXT.
2019-06-20
Q.: Give
Examples
Of 0-1 cases
Of interest.
2019-06-20
2019-06-20
Binary Dependent Variable – Some Definitions.
1) Standard linear Model Y* = X + , but now Y not continuous:
 usual assumptions and k usually unknown. Ergo, estimate Y*?
Assume k = 0 below for ease of exposition.
Let  = Pr (Y = 1 / X = x) = Pr (X +  ) > 0 = Pr ( > - X) =
= 1 – F (- X), F is CDF of .
Since E () = 0  E (Y*) = X , expanded in terms of Y as:
E (Y) = ∑  i yi =  (1) + (1 - ) (0) =    = X.
Problem: if Y is 0/1   can take only two values, and therefore is
not normal (do we care?).
1 * , * .
0 .
. ., 'good' student
if GPA >3.
{ 
 if Y k Y continuous
Otherwise
e g
Y
2019-06-20
Binary Dependent Variable – Some Definitions to drop linear model.
If Yi = 1  i = 1 – E (Yi) = 1 - Xi = 1 - i, occurs with prob. i,
if Y = 0   = -, which occurs with probability (1 - ): error not normally but
binomially distributed, and its variance not constant (If  = .5  maximum
variance = .25. If . = .01 (in tails)  variance = .0099.   → 1, variance
→ 0)  any predictor affecting mean also affects variance  usual linear
model not applicable because model assumes constant variance.
Obviously, estimated  in Y* above may lie outside [0;1]  est. var. < 0. Still
could rescale predicted values to lie within 0-1 (Foster & Stine, 2004), but
how to compare two models that predict at different values > 1? (rescaling to
(0, 1) is quite ad-hoc.)
More serious reason to drop linear model:
Linear marginal effects: linear model  ΔX on prob (Y) is constant regardless
of initial value of prob( Y). E.g.,: ΔX = 1  Δprob (Y) = .1 whether prob (Y) =
.5 or prob (Y) = .93. Intuitively, Δprob (Y) should be larger for prob (Y) closer
to 0.5.
2019-06-20
Contrived example to motivate marginal effects.
Suppose urn with 50 Red balls and 50 Blue balls. The exercise is to determine
percentage of Red balls in urn, and ‘effort’ required to increase probability of
obtaining a red ball.
“Effort” (i.e., X variable) is number of additional Red balls to add to increase
probability. Starting for 50/50 mixture, X = 0 (i.e., not adding red ball yet),
just gives 50%. To reach 51%, 52%, need to solve for X in equation (B
remains at 50, r = # reds, b = # blues):
Let’s see how to raise prob from .5 to 0.95 in jumps of 0.05 by raising r:
*
1
r b prob
prob r
r b prob
  
 
Obs b prob r
1 50 0.50 50.000
2 50 0.55 61.111
3 50 0.60 75.000
4 50 0.65 92.857
5 50 0.70 116.667
6 50 0.75 150.000
7 50 0.80 200.000
8 50 0.85 283.333
9 50 0.90 450.000
10 50 0.95 950.000
/
.
not constant,
as in linear regression
prob X 
It takes an increasing #
Of red balls to increase
Prob.
2019-06-20
Prob rises from .75 to .8 for red rising from 150 to 200, but going from .90 to .95
requires to add red balls from 450 to 950.
2019-06-20
Binary Dependent Variable – Some Definitions.
For ordinal or nominal dependent variables, coding is arbitrary, and any
transformation gives different results.
 transformation of .
Need non decreasing function of X to unit interval. Function usually
chosen is CDF of unit-normal distribution  PROBIT method, or
standardized logistic distribution,  linear logistic or logit model (more
convenient, no integrals, similar to t-distribution with 7 dfs.). Non-
symmetric: log-log and Complimentary log-log (cloglog used in survival
and interval-censored models).
NB: Purpose is to model probability of occurrence of event.
2019-06-20

Logistic: P1 = 1 / (1 + exp (- X * 3))
Probit: P1 = Inv_norm ( X * 3 * / )
Linear: P1 = .5 + X / 3
Complimentary log-log (extreme value, Gompit, note its
skewness): P1 =
3
In next slide :
   

Not shown :
log log : log( log( ))
(seldom used because of inappropriate
behavior for <.5)
For which link is "best" see Koenker and Yoon
(2009, Journal of Econometrics).
1- exp(-exp (x * 3))
Different functions.
2019-06-20
Binary Dependent Variable – Comparison of approaches.
Cloglog approaches 1 faster.
2019-06-20
Binary Dependent Variable, odds and log odds.
Logistic is easier than probit mathematically.
Note that the logit (‘linear equation’, ) estimates log-odds, not
probability. Log-odds popular in gambling: e.g., odds of 19 to 1 that the ‘house’
will win (equivalent to 95% probability of winning for the house).
π: nonlinear function, requires iterative method of estimation.
NB: Will skip individual parameter inference.
1
[1 ] ,logistic cdf
1
Interpretable as log-odds:
Odds
1
log-odds or logit.log( )
1
X
X
X
X
e
e
e
e
X










 
  

  

  

X 
X 
2019-06-20
X = Unif * 2 * N (9)
Y = 1/ (exp(-1.5*x))
2019-06-20
Example of gambling odds. What is the meaning of 1 / p?
Especially in the case of rare events, that is when the probability of an event is
low, the reciprocal of the probability, 1 / p, provides a 1 in ‘n’ re-scale that
can be informative. For instance, if prob (event) = 0.018, its reciprocal is
about 1 in 552 tries. If the probability distribution is geometric, then when p
= 0.018, you would need 552 flips of a coin to obtain the desired event.
We can relate this topic to odds, which will be useful later on. If odds of winning
a bet are 1.25, and you bet 8, then you get 8 * 1.25 = 10 back when you win
(including the original bet, for a total profit of 2), and nothing in the case of
a loss. The bet would be fair if the probability of winning was 8/ 10 = 0.8,
the reciprocal of which is 1.25.
In the context of survey sampling, the reciprocal of the probability of being
included is called the sampling weight.
2019-06-20
HW: What
Is
Probability?
2019-06-20
Example for
Fraud as
Dependent or
Target variable.
2019-06-20
Fraud Data
Notice
1From .....
**************************************************************** 1
.............. Basic information on the original data set.s: 1
.............. .. 1
.............. Data set name ........................ train 1
.............. Num_observations ................ 3595 1
.............. Validation data set ................. validata 1
.............. Num_observations .............. 2365 1
.............. Test data set ................ 1
.............. Num_observations .......... 0 1
.............. ... 1
.............. Dep variable ....................... fraud 1
.............. ..... 1
.............. Pct Event Prior TRN............. 20.389 1
.............. Pct Event Prior VAL............. 19.281 1
.............. Pct Event Prior TEST ............ 1
***************************************************************** 1
2019-06-20
Model Fit Statistics
Intercept
Intercept and
Criterion Only Covariates
AIC 2320.999 1968.388
SC 2326.767 2008.768
-2 Log L 2318.999 1954.388
R-Square 0.1429 Max-rescaled R-Square 0.2286
Testing Global Null Hypothesis: BETA=0
Test Chi-Square DF Pr > ChiSq
Likelihood Ratio 364.6107 6 <.0001
Score 400.9366 6 <.0001
Wald 268.8171 6 <.0001
Typical output for logistic regression, similar to linear.
2019-06-20
Analysis of Maximum Likelihood Estimates
Standard Wald
Parameter DF Estimate Error Chi-Square Pr > ChiSq
INTERCEPT 1 -0.3390 0.2145 2.4980 0.1140
DOCTOR_VISITS 1 -0.00609 0.00820 0.5517 0.4576
MEMBER_DURATION 1 -0.00657 0.000807 66.4090 <.0001
OPTOM_PRESC 1 0.1429 0.0307 21.5918 <.0001
TOTAL_SPEND 1 -0.00002 6.015E-6 8.1934 0.0042
NO_CLAIMS 1 0.7634 0.0550 192.3966 <.0001
NUM_MEMBERS 1 -0.1089 0.0596 3.3309 0.0680
Odds Ratio Estimates
Point 95% Wald
Effect Estimate Confidence Limits
DOCTOR_VISITS 0.994 0.978 1.010
MEMBER_DURATION 0.993 0.992 0.995
OPTOM_PRESC 1.154 1.086 1.225
TOTAL_SPEND 1.000 1.000 1.000
NO_CLAIMS 2.146 1.926 2.390
NUM_MEMBERS 0.897 0.798 1.008
2019-06-20
Coefficient
Interpretation.
2019-06-20
Coefficient interpretation: Odds, Odds-ratios (non-interactive model).
 
 
  
    
 
 
 
 
   
  


1 2
1 2
( 1 1) 2
2 .
log ( ) when 1 0, 2
ˆ
log( ) log 1 2
ˆ
1,
( / 1 1)
( / 1 1 1)
0,
log ( ) for 1
(
a bX cX
a bx cX
a b x cX
X any value
estimated odds a bX cX
b
Odds e
Odds Y X x
Od
a odds Y X
s Y
X
d X x
odds
OR
ratio Y X
e
e
Odds io XRat




  
 
       
     
  
( 1 1) 2
1 2
.04 1.0408 ( ) 4.08% 1.
1 1 ( ) ( 1)%
1)
b
b
b
a b x cX
a bx cX
If b e Odds Y for X
X Odds Y e
e
e
e
2019-06-20
Coeff. interpretation - dummy var – non-interaction model.
α+β
α+β
α+β
Yε{0,1}, Xε{0,1}....logit(Y) = α + βX + ε
e
Pr(Y=1/X=1)= .............Finding odds
1+e
Pr(Y=1/X=1) Pr(Y=1/X=1)
= =e ,,,,, odds X=1.
Pr(Y=0/X=1) 1-Pr(Y=1/X=1)
Pr(Y=1/X=0)
=......................
Pr(Y=0/X=0)
α
α+β
β
α
α+rβ
........=e ,,,,,,,,, odds X=0.
e
Odds Ratio: ratio of odds= =e :
e
More generally: if Xε{r,s},odds X=r: e (similarly for X = s).
Odds ratio
β is the change in log odds due to one unit change of X.
b s-r
of increase of X from r to s is:{e }
2019-06-20
Coefficient interpretation: Odds-ratios (interactive model).
NB: If X2 measured in deviation terms, say from mean  eb is
odds-ratio for X2 = mean.


  
  
    
    

  
     
 
 
 
   
   
1 2 1 2
2 1 2
( 1 1) 2 ( 1 1) 2
( 1 1) 2 ( 1 1) 2
2
1 2 1 2
ˆ
log( ) . log 1 2 1 2
ˆ
( / 1 1)
( / 1 1 1)
( 1)
a bX cX dX X
a bd cX dx X
a b x cX d x X
a b x cX d x X
b dX
a bx cX dx X
est odds a bX cX dX X
Odds e
Odds Y X x e
Odds Y X x e
e
e
Odds Ratio X e
e
  

( 1), 2 0
.
b
is NOT odds ratio X except when X
odds ratio is conditional on X values
2019-06-20
Coefficient interpretation: Odds-ratios (interactive model).
( 1 2 1)
( 1 1) 2 ( 1 1) 2
2
1 2 1 2
( 1 1) 2 ( 1 1) 2
1
1 2 1 2
ˆ
log( ) . log 1 2 1 2
ˆ
( 1, 2)
( 1)
( 2)


   
    

  
    

  
     
 
 
  
  
b c d X X
a b x cX d x X
b dX
a bx cX dx X
a b x cX d x X
c dX
a bx cX dx X
est odds a bX cX dX X
Odds Ratio X X e
e
e
Odds Ratio X e
e
e
Odds Ratio X e
e
( 1, 2)
( 1) * ( 2)

d OR X X
OR X OR X
2019-06-20
2019-06-20
Analysis of Maximum Likelihood Estimates (repeated)
Standard Wald
Parameter DF Estimate Error Chi-Square Pr > ChiSq
INTERCEPT 1 -0.3390 0.2145 2.4980 0.1140
DOCTOR_VISITS 1 -0.00609 0.00820 0.5517 0.4576
MEMBER_DURATION 1 -0.00657 0.000807 66.4090 <.0001
OPTOM_PRESC 1 0.1429 0.0307 21.5918 <.0001
TOTAL_SPEND 1 -0.00002 6.015E-6 8.1934 0.0042
NO_CLAIMS 1 0.7634 0.0550 192.3966 <.0001
NUM_MEMBERS 1 -0.1089 0.0596 3.3309 0.0680
Odds Ratio Estimates
Point 95% Wald
Effect Estimate Confidence Limits
DOCTOR_VISITS 0.994 0.978 1.010
MEMBER_DURATION 0.993 0.992 0.995
OPTOM_PRESC 1.154 1.086 1.225
TOTAL_SPEND 1.000 1.000 1.000
NO_CLAIMS 2.146 1.926 2.390
NUM_MEMBERS 0.897 0.798 1.008
Logistic regr. Output for coefficient evaluation.
2019-06-20
Interpretation.
Intercept: exp(-0.3390) is the odds-ratio 0.71, ratio of probability
of fraud versus non-fraud when all predictors are zero, i.e., no
doctor visits, no prescriptions, zero member duration, etc.
Member Duration: 0.993 (odds ratio point estimate) is the odds
ratio and is less than 1 which means that the ratio of probability of
fraud versus non-fraud is less than 1 the longer the member
duration. Odds ratio is constant as long as there is no interaction
term.
Number of claims odds ratio of 2.146 means that the ratio of the
probability of fraud versus non-fraud increases with the number of
claims, as long as there is no interaction term.
Etc.
2019-06-20
Notice: validation results added.
2019-06-20
2019-06-20
33
2019-06-20
2019-06-20 Leonardo Auslender Copyright 2009
2019-06-20
Binary Dependent Variable – Log-likelihood. ***
Since observations assumed independent, joint probability is: (and take
logs because it is easier to work with logs):
Y 1 Y
1 n
Log likelihood function
(omitting"i" subscr
Pr(Y ,,,, Y ) (1 ) ,
taking logs, and remembering
that f(Data, ) :
ln( , data) Y ln (1 Y)ln(1 )
ln(
ipt)
, data) Y ln ln(1 )
1
 
 
  

 


 

    
 






35
2019-06-20
Binary Dependent Variable Model evaluation via Inference.
Nested Approach:
Does model with added predictor/s provide “significantly” more information
about dependent variable than model w/o? Typically, H0: fuller model, H1:
constrained model (NB: no notion of model fit as in R2 of regression).
Compare two nested models:
Log (odds) =  + 1x1 + 2x2 + 3x3 + 4x4 (model 1) H0
Log (odds) =  + 1x1 + 2x2 (model 2) H1
Three tests:
–Likelihood ratio Test (LRT): balanced between H0 and H1.
–Wald test: starts at H1 and looks for improvement towards H0.
–Score or Lagrange Multiplier (LM) test: starts at H0 and asks whether
movement towards H1 is improvement. 1st derivative of likelihood function is
called score function.
2019-06-20 Leonardo Auslender Copyright 2009
2019-06-20
Model χ2 and Model Comparisons.
Likelihood-ratio test: used to contrast two models, one of which is subset of other
(i.e., some j’s set to zero, called Nested model Inference).
G0 = 2 (log L1 – log L0)  χ2 (k), k coefficients set to 0 in formulae (called
Deviance, and sometimes residual deviance):
G0 called -2 LLR or 2 LLR (dependent on order of likelihoods).
Bit of confusion: -2 LLR is also deviance, but specific to model vs. saturated model,
and saturated model has LLR = 0.
Is Likelihood same as probability?
In the case of probability, we know the pdf’s (prob. Distr. Function) and parameter values
and want to know probability of specific event/s. I.e., we know betas ….
In case of likelihood, given specific event/s, estimate pdf and/or model parameters. That’s
why, given data, want to estimate parameters that maximize the probability of the event/s
reflected by the data, for an assumed pdf. We don’t know but estimate Betas …
37
2019-06-20
Score (LM) Test.
Let U(β) be vector d LLR / d β: Gradient vector.
Let H(β) be matrix d2 LLR / d β2 Hessian matrix.
Let I(β) = - H(β) or expected value of - H(β).
Let β0 be MLE estimates under H0.
 U’(β0) I(β0) -1 U(β0) ~ χ2(r)
2019-06-20
Example, Y ~ B (n, theta) ***
H0: theta = 0.5, H1: theta = 0.2. Wald based on (0.5 – 0.2), LRT on vertical
distance, LM on slope of line at loglikelihod at theta = 0.2. When loglikelihood
smooth curve, all tests yield same answer.
LOGLIKEL1
-5000
-4000
-3000
-2000
-1000
0
THETA
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
SCORE TEST IS SLOPE AT THETA = 0.2
HALF DEVIANCE
----WALD STATISTIC-------
2019-06-20
Non-nested Model Comparisons:
Akaike Information Criterion (AIC) = Deviance (=-2LRR) + 2k,
k = # model parameters.
BIC (Bayesian information criterion) or SC = Deviance + k log n.
In both cases, smaller  better model.
NOTE: in non-parametric models, such as Tree based models, ‘k’
is not fully defined  no AIC or BIC fully accepted.
2019-06-20 Leonardo Auslender Copyright 2009
2019-06-20
Somer’s D and Pairs Concordance.
P = # possible pairs observations with different values of Y = n0 n1.
% Concordant = % Pairs / Prob (Y = 1 / Y = 1) > Prob (Y = 1 / Y = 0), nc / P.
% Discordant = % Pairs / Prob (Y = 1 / Y = 1) < Prob (Y = 1 / Y = 0), nd / P.
% Tied: % pairs neither concordant nor discordant, nt / P.
(all pairs discordant) -1 ≤ Somer’s D ≤ 1 (all pairs
concordant) = (nc – nd) / (nc + nd + nt),
Gamma = (nc – nd) / (nc + nd)
Stuart Tau a = (nc – nd) / P
c = .5 (1 + Somer’s D), called ROC more popularly (reviewed below).
41
2019-06-20 Leonardo Auslender Copyright 2009
2019-06-20
Hosmer-Lemeshow fit (2000, p. 148).
Posterior probabilities in logistic regression are calculated from
covariate patterns. In data mining, most patterns have very few
observations, and many potential patterns, are empty.
Assume 5 binary predictors in the model  maximum number
of patterns = 2 ^ 5 = 32. Assume only 8 patterns exist (J) in the
data  expected values will be small or 0 in most cells. In data
mining, typically n = J. With continuous predictors, number of
patterns is much larger.
42
2019-06-20 Leonardo Auslender Copyright 2009
2019-06-20
Hosmer-Lemeshow fit “decile of risk” (2000, p. 148).
Proposal: “create patterns” by grouping by percentiles of the posterior
predicted distribution. By simulation, if g = 10 percentiles (deciles)
chosen, distributed as chi-square (g – 2) when J = n. Use
Pearson’s χ2 test, small p-values indicate lack of fit.
Problem: Hosmer-Lemeshow test known to fail with
continuous covariates due to large number of potential
patterns.
Besides, known to produce misgiving results due to ties or order of
observations (Bertolini et al, 2000). Test considered obsolete (Hosmer et al,
1997).
DO NOT USE IT.
43
2019-06-20
2 2
2
2
2
2
2
(0)
: 1 [ ] 1 (0)
ˆ( )
(1991) :
(obtained via RSQUARE option in SAS Proc Logistic,
called Rsquare and Ma
Cox and Snell (1989)
Nagelkerke
Brie
x-Rescaled
r's S
R r
cor
esp
e:
ectively).
1
ˆ(
N N
i
L
L Max R
L
R
Max R
P
n
R

   
2
1
)
n
i
i
Y


What would statisticians do without R2? Create more …
2019-06-20 Leonardo Auslender Copyright 2009
2019-06-20 45
(1)
(2)
(3)
(4)
(5)
(6)
2019-06-20 Leonardo Auslender Copyright 2009
2019-06-20 46
(1) Significant LRT, at least one Beta significantly different
from 0. Analogous to F-test in Linear Regression.
Agreement with (2) and (3). If no agreement, model is in
doubt.
(2) (4) and (5) confirm that intercept-only model is inferior.
(3) In (6), difference between Intercept only model and
Intercept and covariates yields (1).
2019-06-20
Association of Predicted Probabilities and Observed Responses
Percent Concordant 75.4 Somers' D 0.512
Percent Discordant 24.2 Gamma 0.515
Percent Tied 0.4 Tau-a 0.160
Pairs 870504 c 0.756
Partition for the Hosmer and Lemeshow Test
FRAUD = 1 FRAUD = 0
Group Total Observed Expected Observed Expected
1 237 13 10.71 224 226.29
2 237 18 17.08 219 219.92
3 237 15 21.88 222 215.12
4 237 25 26.73 212 210.27
5 237 34 31.53 203 205.47
6 237 36 37.13 201 199.87
7 237 39 43.19 198 193.81
8 237 53 52.97 184 184.03
9 237 84 71.69 153 165.31
10 232 139 143.09 93 88.91
2019-06-20 Leonardo Auslender Copyright 2009
2019-06-20 48
2019-06-20 Leonardo Auslender Copyright 2009
2019-06-20
Concordance Measures.
STUDY
Cloglog Logit Probit
VALUE VALUE VALUE
MEASURE
0.545 0.565 0.565Gamma
Pairs 138659.0 138659.0 138659.0
Percent Concordant
71.400 77.128 77.117
Percent Discordant 21.033 21.430 21.415
Percent Tied 7.567 1.442 1.468
Somers' D 0.504 0.557 0.557
Tau-a 0.245 0.271 0.271
c 0.752 0.778 0.779
Example: Logit and Probit almost identical.
49
2019-06-20
50
2019-06-20
Misclassification Rates (typically assumes 0.5 cutoff)
Confusion Table.
0: “Negative” (non-event) “1”: Positive (event)
TN: True Negative FN: False Negative
FP: False Positive TP: True positive.
“True”, “False” refer to actual state of nature. “Negative” and
“Positive” refer to prediction.
51
Actual Pr edicted 0 1
0 non event TN FP
1 Event FN TP
 

2019-06-20
Misclassification Rates, K-S, ROC (cont. 1).
1) Classification accuracy or overall classification rate: (1 –
classification error rate, converse of misclassification rate), “0” and
“1” Classification Rates:
Overall: TP + TN / (TP + TN + FP + FN).
Assumes known and unchanging “natural” class distribution and that error
cost of FP = errors FN. Typically favors majority class; but in most
applications, cost of misclassifying “1” is higher.
“0” and “1” Classification Rate: TN / (TN + FP) ………..…. “0”
TP / (TP + FN) ……….…. “1”
Misleading results: Assume “1” important. Overall (left model) = 92.5%,
Overall (right model) = 97.5%, but right model misses all “1”s.
Predicted 0 Predicted 1 Predicted 0 Predicted 1
Actual 0 180 15 195 0
Actual 1 0 5 5 0 52
2019-06-20
Misclassification Rates, K-S, ROC
2) Event (“1”), non-Event (“0”) and overall Precisions:
PR(1) = TP / (TP + FP) Event Precision
PR(0) = TN / (TN + FN) Non-Event Precision
PR = (TP + TN) / 2 (Theoretical max for PR (i) = 1).
(of those predicted as “1”, proportion of “true” ones …)
3) Area under Receiver Operating Characteristic Curve
(AUROC): Graph of FP rate (x-axis) vs. TP rate (Y-axis) varying
threshold of classification. Laplace estimate is used (in tree algorithms)
because more consistent improvements in ROC curves.
TP rate = TP / (TP + FN) (sensitivity)
FP rate = FP / (FP + TN) ( 1 – specificity)
TN / (FP + TN) specificity
53
2019-06-20
54
Recall, F-1 …
Recall / Sensitivity / TPR: TP / TP + FN. Percentage of
events well classified.
F1-score / F-measure / F-score: harmonic mean of precision
and recall. Requires that R and P not move opposite to each
other. Useful when costs of misclassification between 0 and
1 are very different. If Costs similar and counts of FN ≈ FP,
can use accuracy. Higher F1, better. Also, weighted F1-score
used to give more importance to P or to R.
2 * (recall * Pr ecision)
F1
Re call Pr ecision


Specificity / TNR: TN / TN + FP
2019-06-20
55
Which measure to use?
If costs similar and counts of FN and FP similar and Recall
does not move opposite Precision, use F1. Also, when event
prior much smaller.
Choose Recall if FN is more important than FP. E.g.,
misdiagnosing cancer patient as healthy.
Choose prediction when want to capture as many positives
as possible. E.g., Spam, prefer to miss SPAM detection as
true, rather than sending good message to spam folder.
Specificity if want to concentrate on non-events and avoid
False alarms (FP). E.g., trial verdict.
2019-06-20
2019-06-20
Validation: Rates '-' ==> misclass &
Missprec Predicted Class
0 1 Overall
Class
Rate
Prec
Rate
Class
Rate
Prec
Rate Class Rate
Prec
Rate
Fraudulent Activity
yes/no
Model Name
76.87 88.53 -23.13 -59.64 76.870 M1_TRN_LOGISTIC_NONE
0 M1_VAL_LOGISTIC_NONE 78.31 88.62 -21.69 -61.06 78.31
1 M1_TRN_LOGISTIC_NONE -38.88 -11.47 61.12 40.36 61.12
1 M1_VAL_LOGISTIC_NONE -42.11 -11.38 57.89 38.94 57.89
Overall M1_TRN_LOGISTIC_NONE 88.53 40.36 73.66 73.66
Overall M1_VAL_LOGISTIC_NONE 88.62 38.94 74.38 74.38
Model classifies correctly 76.87% of non-frauds as non-frauds
And misclassifies 23.13%, with probability cutoff point of
50%.
2019-06-20
2019-06-20
2019-06-20
2019-06-20
Misclassification Rates, K-S, ROC
Heuristically, AUROC = # pairs of observations, one ‘0’ one ‘1’, such that
posterior P(X / Y = 1) > P(X / Y = 0). Max value is 1.
Classification methods typically maximize classification rates 
tendency to focus just on these rates, which does not necessarily
provide good balance between events and non-events.
ROC curves can cross, e.g. data
sets with different proportions of
0/1. NW direction  better model.
B is preferred to C for low FP
rates, but C preferred over B later
on. A clearly inferior.
“Best” model: max AUROC. But
max AUROC may not be ‘best’ for
specific cost and class
distribution. Plus, based on
classification and not precision
measures (detailed below). 61
2019-06-20
Misclassification Rates, K-S, ROC
ROC:
1) step function built by varying classification threshold.
Different algorithms to ‘smooth’ curves.
2) invariant to any monotonic transformation of posterior
probabilities, because just rankings matter.
62
2019-06-20 Leonardo Auslender Copyright 2009
2019-06-20 63
AUROC comparison: Gradient Boosting better than logistic
For TRN and VAL.
2019-06-20 Leonardo Auslender Copyright 2009
2019-06-20 64
2019-06-20 Leonardo Auslender Copyright 2009
2019-06-20 65
2019-06-20 Leonardo Auslender Copyright 2009
2019-06-20 66
....
Precision P - Recall R and
P R vs. cutoff
2019-06-20 Leonardo Auslender Copyright 2009
2019-06-20 67
2019-06-20 Leonardo Auslender Copyright 2009
2019-06-20 68
Ideally, curve reaches NE corner for better model, and Gradient
boosting arches NE more than Logistic, equivalent to building PR-AUR
and compare areas (not shown). GB better than LG. Only TRN shown.
2019-06-20 Leonardo Auslender Copyright 2009
2019-06-20 69
Standard recommendation
For balanced data (i.e., prior about 50% and equal costs),
choose most outward ROC curve as model of choice.
Otherwise, use Precision Recall curve. Notice that non-events
affect mostly ROC curve.
NOTE: we included Gradient Boosting (reviewed at later
chapter to emphasize model comparisons.
2019-06-20 Leonardo Auslender Copyright 2009
2019-06-20 70
2019-06-20
Explanation via Example: Experian – White Paper.
Lift or gainschart for credit promotion. Table can be created by decile (10%
intervals) or vingtiles (20%) intervals of posterior prob. of being “good”, in
descending order of prob. More often binned by equal sized bins of observations in
descending order of probability. Max potential lift is 1 / event prior percentage.
When comparing different models by deciles, note that underlying probabilities can
be different across different models (e.g., logistic vs Gradient Boosting).
Cum Good % (Bads) measures cumulative % of all Goods (Bads) captured in
corresponding vingtile. % Cum Diff is corresponding difference between goods and
bads, and its largest difference is value of K-S test.
Rule of thumb used: good model should capture at least 70% goods in 5th decile /
10th vingtile.
Lift: ratio of % events in decile/vingtile to overall event rate (# responders in
decile/vingtile / total # responders). Cum lift: Corresponding cumulative.
K-S (Kolmogorov-Smirnov): prob. point at which [Cum % capture events - non-
events] is highest. Frequently used in finance.
71
2019-06-20
Example: Experian – White Paper – Gains Table for goods vs
bads (data set not available).
72
2019-06-20
73
2019-06-20
Gains Table
% Event
Cum %
Events
% Capt.
Events
Cum %
Capt.
Events Lift Cum Lift
Pctl Min
Prob
Max Prob Model Name
59.92 59.92 31.14 31.14 3.11 3.11
10 0.384 0.997 M1_VAL_LOGISTIC_NONE
0.424 1.000 M1_TRN_LOGISTIC_NONE
63.61 63.61 31.24 31.24 3.12 3.12
20 0.250 0.382 M1_VAL_LOGISTIC_NONE
35.59 47.78 18.42 49.56 1.85 2.48
0.263 0.423 M1_TRN_LOGISTIC_NONE
35.38 49.51 17.33 48.57 1.74 2.43
30 0.199 0.250 M1_VAL_LOGISTIC_NONE
21.94 39.15 11.40 60.96 1.14 2.03
0.207 0.263 M1_TRN_LOGISTIC_NONE
23.06 40.69 11.32 59.89 1.13 2.00
40 0.169 0.199 M1_VAL_LOGISTIC_NONE
15.68 33.30 8.11 69.08 0.81 1.73
0.177 0.207 M1_TRN_LOGISTIC_NONE
18.38 35.12 9.00 68.89 0.90 1.72
50 0.143 0.168 M1_VAL_LOGISTIC_NONE
15.19 29.67 7.89 76.97 0.79 1.54
0.152 0.177 M1_TRN_LOGISTIC_NONE
14.17 30.92 6.96 75.85 0.69 1.52
60 0.123 0.143 M1_VAL_LOGISTIC_NONE
15.25 27.27 7.89 84.87 0.79 1.41
0.130 0.152 M1_TRN_LOGISTIC_NONE
15.60 28.37 7.64 83.49 0.77 1.39
70 0.103 0.123 M1_VAL_LOGISTIC_NONE
9.70 24.76 5.04 89.91 0.50 1.28
0.107 0.130 M1_TRN_LOGISTIC_NONE
13.33 26.22 6.55 90.04 0.65 1.29
80 0.082 0.103 M1_VAL_LOGISTIC_NONE
6.36 22.46 3.29 93.20 0.33 1.17
0.085 0.107 M1_TRN_LOGISTIC_NONE
10.03 24.20 4.91 94.95 0.49 1.19
90 0.061 0.081 M1_VAL_LOGISTIC_NONE
8.02 20.85 4.17 97.37 0.42 1.08
0.063 0.085 M1_TRN_LOGISTIC_NONE
6.39 22.22 3.14 98.09 0.31 1.09
100 0.004 0.061 M1_VAL_LOGISTIC_NONE
5.08 19.28 2.63 100.00 0.26 1.00
0.063 M1_TRN_LOGISTIC_NONE
3.90 20.39 1.91 100.00 0.19 1.00
2019-06-20
TRN and VAL results very
similar, no overfitting.
2019-06-20
Selecting Cutoff point (remember ‘k’ of first pages?)
Many criteria, but BRAINS are most important however.
KS (Kolmogorov-Smirnof): Probability point at which events and non-events are most
separated cumulatively.
Cumulative Lift: Probability point for selective cum lift value (arbitrary).
Profit/Cost Business decision: Prob point at which Max Profit / Min Cost.
Many others.
BUT:
1) Consider costs of decision: mail piece to wrong person or wrong HIV
treatment?
2) Are categories under study discrete or were they derived from a
continuous scale, such as deciding default as 3 month late payment. If
so, shouldn’t decision take into consideration how late payment was?
3) If probabilities are too close to decision point, shouldn’t we get more
data to decide?
2019-06-20
2019-06-20
Lines at 0.3 and
0.207 are different
Cutoffs. Note that
Proportion of true
Events is higher
than prob
Indicated by
Logistic at
about point 0.3.
***
2019-06-20 Leonardo Auslender Copyright 2009
2019-06-20 79
2019-06-20
Forward selection for logistic regression (Hosmer and Lemeshow, 2000)
Assume ‘p’ available predictors. At step 0 (no variables yet entered), fit “intercept-
only model” and denote log-likelihood by L0. Fit ‘p’ univariate logistic regressions
for each predictor, and obtain ‘p’ log-likelihoods, L1,,,,,Lp.
Calculate ‘p’ likelihood ratio tests, Gj = -2 ( L0 – Lj), j = 1…p, and obtain
corresponding p-values pj, such that Prob [χ2(ν) / G]j) = pj, where ν = 1 if Xj is
continuous, and ν = k – 1, if Xj has k categories.
Choose predictor Xj with minimum p-value that is below the entry ‘alpha’ level, which
usually ranges between .15 and .2. Call this predictor Xj0.
In first step above, logistic regression with Xj0 takes previous role of intercept-only
regression, and new search is started with p -1 candidate predictors.
Search is stopped when no p-value is less than the entry alpha level. While α = 5% is
embedded in many practitioners’ activities, it is known that in variable selection that
level is too stringent. It is usually recommended that “α” level for entry be in the
range from 15% to 20% or even higher dependent on the breadth or exploratory
nature of the study.
Notice that at least in Hosmer and Lemeshow (2000), remaining predictors are not
orthogonalized relative to the one just selected (as in variable selection for linear
regression. Orthogonalization obtained when partialing).
2019-06-20
Stepwise Selection for logistic regression.
1. We proceed in a fashion similar to forward selection in steps 1 and 2, and assume
that we have selected X1 and X2 so far. At this moment, we perform a backward
selection step to check whether we should retain the variables selected earlier. We
accomplish this by creating a model with just X1, obtaining its log-likelihood
and by LRT comparing it to the full model log-likelihood.
2. In the general case of k ‘entered’ variables, the candidate to remove is that
which yields the highest LRT p-value when removed. The comparison is done
against the α-to-remove level that marks the threshold level of explanatory power.
If the highest p-value is higher than the α-to-remove, the variable is removed;
otherwise, it stays in the model.
3. The α-to-remove level must be higher than the α-to-enter to prevent cycles of
the same variable entering and leaving the model. Higher α levels, both to enter
and to be removed, allow for more variables to remain in the model.
4. The search continues adding and removing variables until a) either all ‘p’ candidate
variables have been entered; or b) all the variables in the models have p-values to
remove that are less than the α-to-remove level, and those variables not selected
have p-values higher than the α-to-enter level.
2019-06-20
2019-06-20
Logistic Selection Steps
model_name
M1_TRN_L
OGISTIC_
BACKWAR
D
M1_TRN_
LOGISTIC
_STEPWI
SE
M1_TRN_L
OGISTIC_F
ORWARD
#
in
mo
del
P-
value
#
in
m
od
el
P-
value
#
in
m
od
el P-value
Step Effect Entered Effect Removed
5 0.7501 num_members
no_claims
1 0.000 1 0.000
2 member_duration
2 0.000 2 0.000
3 optom_presc
3 0.000 3 0.000
4 total_spend
4 0.001 4 0.001
5 doctor_visits
5 0.019 5 0.019
Summary of entry/removal by 3 selection methods
2019-06-20 Leonardo Auslender Copyright 2009
2019-06-20 84
2019-06-20 Leonardo Auslender Copyright 2009
2019-06-20 85
Doctor_visits Not removed even though insignificant.
2019-06-20 Leonardo Auslender Copyright 2009
2019-06-20 86
2019-06-20
87
2019-06-20
88
2019-06-20
89
2019-06-20
90
2019-06-20
91
GOF ranks
GOF measure
rank
AURO
C
Av
g
Sq
uar
e
Err
or
Cum Lift 3rd
bin
Cum
Resp
Rate
3rd Gini
Rsquare
Cramer
Tjur
Rank
Ra
nk Rank Rank Rank Rank
Unw.
Mean
Unw.
Median
Model Name
1 1 1 1 2 2 1.33 1.00
01_M1_TRN_LOGISTIC_BACKWAR
D
02_M1_TRN_LOGISTIC_FORWARD
2 2 2 2 3 3 2.33 2.00
03_M1_TRN_LOGISTIC_NONE
3 3 3 3 1 1 2.33 3.00
04_M1_TRN_LOGISTIC_STEPWISE
4 4 4 4 4 4 4.00 4.00
GOF ranks
GOF measure
rankAUROC
Avg
Square
Error
Cum
Lift 3rd
bin
Cum
Resp
Rate
3rd Gini
Rsquare
Cramer
Tjur
Rank Rank Rank Rank Rank Rank
Unw.
Mean
Unw.
Median
Model Name
2 2 2 2 2 2 2.00 2.00
05_M1_VAL_LOGISTIC_BACKWAR
D
06_M1_VAL_LOGISTIC_FORWARD
3 3 3 3 3 3 3.00 3.00
07_M1_VAL_LOGISTIC_NONE
1 1 1 1 1 1 1.00 1.00
08_M1_VAL_LOGISTIC_STEPWISE
4 4 4 4 4 4 4.00 4.00
2019-06-20
92
2019-06-20
93
K-S point used as cutoff, mostly in finance.
2019-06-20
94
2019-06-20 Leonardo Auslender Copyright 2009
2019-06-20 95
2019-06-20 Leonardo Auslender Copyright 2009
2019-06-20 96
Model performances are very similar, given data set
simplicity. Auroc equivalent across models and data roles.
Possible to add non-logistic techniques to obtain wider
notion of model performance, in chapter of trees and
Ensembles.
2019-06-20 Leonardo Auslender Copyright 2009
2019-06-20 97
2019-06-20 Leonardo Auslender Copyright 2009
2019-06-20 98
2019-06-20 Leonardo Auslender Copyright 2009
2019-06-20
Balanced / Unbalanced target – Rare Event.
Typical situation: Binary dependent variable with far fewer ‘1’s
than ‘0’s: fraud (less than 1%), extreme diseases, oil spills,
wars, decision to run for office, defective products, etc. Logistic
regression in this case, for instance, underestimates probability
of rare events. Also, tendency to create enormous data bases
to contain ‘rares’.
Typically, misclassification cost of ‘1’s higher than
misclassification cost of ‘0’s. Since classifiers typically aim
at maximizing accuracy, 98% ‘0’s is already very good
accuracy but it could lead to very poor ‘1’ accuracy. And
same for ROC.
99
2019-06-20 Leonardo Auslender Copyright 2009
2019-06-20
Most used methods to deal with Unbalanced samples for
most classification methods:
1) Under-sample ‘0’s,
2) 2) over-sample ‘1’s (by re-sampling with replacement). or
3) both.
4) SMOTE does 3 with special over-sampling of ‘1’s.
5) Use cost functions (usually difficult to establish).
SMOTE: Over-sample by creating synthetic observations from near ‘1’s clusters
(multiply difference between chosen cluster values and original by random
value between 0, 1) and add to original value.
Bing Zhu (Sichuan University), Bart Baesens (KU Leuven) & Seppe
vanden Broucke (KU Leuven) (2017) argue that 50:50 resampling is not
necessary when only focusing on gof measures.
100
2019-06-20 Leonardo Auslender Copyright 2009
2019-06-20 101
Typically, correct classification of “rares” has greater value than that
of “usuals”.
Is estimation affected?
In case of rare events, most applications yield smaller probabilities
than they should be; but for Y=1, probabilities should be larger (i.e.,
0.8 instead of 0.6)  πi(1- πi) (.16 vs .24) and variance (its inverse) is
higher for .8 than for .6  additional ‘1’s (to make probability higher)
cause variance to drop in further and thus ‘1’s bring in ‘more’
information than ‘0’s (King, Zeng 2001).
PROS: Re-sampling is prevailing methodology.
CONS: Ad-hoc procedure, different opinions on re-sampling mixture,
no analysis of effects on coefficients, drops information.
2019-06-20 Leonardo Auslender Copyright 2009
2019-06-20
Balanced/Unbalanced target, comment.
Re-balancing trees (Auslender, 1998, never finished): create samples with
respective percentages of 0/1 equal to 45/55, 46/54,,,,,, 54/46, 55/54. Typically
observe that upper set of levels is similar or same for all samples (split values
and variables) in tree classification (next chapter).
Lower layer typically contains similar variables that are split, sometimes in
different hierarchical order: variable 1 is split in level 4 in sample 45/55, and in
level 5 in sample 50/50, while variable 2 behaves reciprocally.
Conclude: top level is core of tree, and middle level still provides strong
information. After that, information not reliable.
Similar approach possible for logistic regression in context of co-linearity
(co-linearity induces instability in the coefficients of linear models). But see
Owen next.
NB: Balanced/Unbalanced comparison examples in next lecture.
102
2019-06-20 Leonardo Auslender Copyright 2009
2019-06-20 103
2019-06-20 Leonardo Auslender Copyright 2009
2019-06-20
Owen’s 2007 findings and recommendations for logistic.
Owen (2007): Infinitely imbalanced cases, where N0 →∞ and N1 fixed. For
“given” model (i.e., not variable search model), contribution of X cases
when Y = 1 depends entirely on mean of X when Y = 1, by relation:
Practical Implication: For non-outlier data in X/Y = 1, mixture
of 0/1 does not particularly affect estimation, except intercept
that →-∞. If X / Y = 1 clustered, considered splitting analysis by
clusters.
x
0
0x
0
e xdF (x)
X , F : distr. of X given Y 1
e dF (x)


 


104
2019-06-20 Leonardo Auslender Copyright 2009
2019-06-20
SEPARATION.
Separation is observed in fitting process of logistic regression
model if likelihood converges to finite value while at least one
parameter estimate diverges to (plus or minus) infinity.
Separation primarily occured in small or sparse samples with highly
predictive covariates for fixed model in days of yore.
105
2019-06-20 Leonardo Auslender Copyright 2009
2019-06-20
More interesting case:
groups.google.com/group/MedStats/browse
thread/thread3078fd372b83f662
Obs Y x1 x2 S
1 1 29 62 10
2 1 30 83 29
3 1 31 74 18
4 1 31 88 32
5 1 32 68 10
6 2 29 41 -11
7 2 30 44 -10
8 2 31 21 -35
9 2 32 50 -8
Where S = x2 - 2*x1 + 6. Note perfect
separation of Y for S > - 8. Also, S perfectly
co-linear with X1 and X2 obviously.
106
2019-06-20 Leonardo Auslender Copyright 2009
2019-06-20 107
Canonical
Discriminant
Analysis.
***
2019-06-20 Leonardo Auslender Copyright 2009
2019-06-20 108
Canonical Discriminant Analysis (CDA).
(seen in Cluster Analysis lecture) “Canonical” is the statistical
term for analyzing latent variables (which are not directly
observed) that represent multiple variables (which are directly
observed). CDA is related to PCA and Canonical Correlation
Analysis (CCA).
Canonical Variate (CV) is weighted sum of the variables in the
analysis. CCA is useful in analyzing strength of association
between two constructs of continuous variables.
CDA finds linear functions of variables that maximally separate the
means of groups of observations into two or more groups (given by
a nominal target variable), while maintaining variation within
groups as small as possible. PCA summarizes total variation. #
groups = # nominal levels.
2019-06-20 Leonardo Auslender Copyright 2009
2019-06-20 109
Canonical Discriminant Analysis (CDA).
CDA equivalent to canonical correlation analysis between
INTERVAL vars and set of dummy variables coded from
TARGET. It produces ‘k’ Canonical Discriminant Functions
(CDFs) or canonical variables, ‘k’ = min ( # groups – 1, #
variables).
CDF_1 yields max variation between groups w.r.t. within-
group variation  greatest degree of group differences.
CDF_2, uncorrelated to CDF_1, group differences not
captured by CDF_1.
Previously, clustering examples shows2 CDFs, since number
of chosen clusters was 3.
2019-06-20 Leonardo Auslender Copyright 2009
2019-06-20 110
2019-06-20 Leonardo Auslender Copyright 2009
2019-06-20 111
Un-reviewed Additional Classification Methods.
Linear Discriminant Analysis (especially for multi-categorical
dependent variable) but requires normality as strong assumption.
Also Quadratic DA, and Flexible DA.
Naïve Bayes (also useful for continuous dependent
variables).
Support Vector Machines
Neural Networks (also useful for continuous dep var)
Clustering: Reviewed in earlier presentation, can be used as
classification tool.
2019-06-20 Leonardo Auslender Copyright 2009
2019-06-20 112
ID3 and C4, C4.5, PART (briefly mentioned in Tree presentation).
K-Nearest Neighbor
Mixture discriminant analysis
(and review in later chapters):
Classification and Regression Trees (CART)
Bagging CART
Random Forest
Gradient Boosting
2019-06-20
113
2019-06-20
Basic logistic SAS program.
proc logistic data = &data.;
model &depvar. (event= ‘&event.’) = &outvars. / outroc
= troc;
score data=&validation. Out = &val_out outroc=
valoutroc selection = none;
roc; roccontrast;
run;
114
2019-06-20
SAS program:
For ROC curve, obtained from proc logistic, see outroc.
For auroc, use before “proc logistic” call
Ods associations = assoc_out;
After proc logistic run,
Proc sql noprint;
select nvalue2 into :auroc from assoc_out where
upcase(label2) = “C”;
Quit;
%put auroc is &auroc.;
115
2019-06-20
116
2019-06-20
117
1) Explain, technically or non-technically, separation and
Quasi-separation. Solutions for the problem, or is it not a
problem? Does it apply to other supervised methods? Is it
co-linearity?
2) Interactions in logistic as opposed to in linear
regression? Do some research.
3) Unbalanced samples, problem in logistic? Does it make a
difference whether you have a fixed or a searched model?
4) Accuracy vs. precision?
2019-06-20
118
References
Hosmer D., Lemeshow S. (2000): Applied Logistic Regression, Wiley.
Owen A. (2007): Infinitely Imbalanced Logistic Regression, JMLR
2019-06-20 Leonardo Auslender Copyright 2009
2019-06-20
The
End??
119

More Related Content

What's hot

Statr session 17 and 18 (ASTR)
Statr session 17 and 18 (ASTR)Statr session 17 and 18 (ASTR)
Statr session 17 and 18 (ASTR)Ruru Chowdhury
 
Quantitative Methods for Lawyers - Class #21 - Regression Analysis - Part 4
Quantitative Methods for Lawyers - Class #21 - Regression Analysis - Part 4Quantitative Methods for Lawyers - Class #21 - Regression Analysis - Part 4
Quantitative Methods for Lawyers - Class #21 - Regression Analysis - Part 4Daniel Katz
 
Linear Regression and Logistic Regression in ML
Linear Regression and Logistic Regression in MLLinear Regression and Logistic Regression in ML
Linear Regression and Logistic Regression in MLKumud Arora
 
Quantitative Methods for Lawyers - Class #20 - Regression Analysis - Part 3
Quantitative Methods for Lawyers - Class #20 - Regression Analysis - Part 3Quantitative Methods for Lawyers - Class #20 - Regression Analysis - Part 3
Quantitative Methods for Lawyers - Class #20 - Regression Analysis - Part 3Daniel Katz
 
Quantitative Methods for Lawyers - Class #17 - Scatter Plots, Covariance, Cor...
Quantitative Methods for Lawyers - Class #17 - Scatter Plots, Covariance, Cor...Quantitative Methods for Lawyers - Class #17 - Scatter Plots, Covariance, Cor...
Quantitative Methods for Lawyers - Class #17 - Scatter Plots, Covariance, Cor...Daniel Katz
 
Quantitative Methods for Lawyers - Class #22 - Regression Analysis - Part 1
Quantitative Methods for Lawyers - Class #22 -  Regression Analysis - Part 1Quantitative Methods for Lawyers - Class #22 -  Regression Analysis - Part 1
Quantitative Methods for Lawyers - Class #22 - Regression Analysis - Part 1Daniel Katz
 
Percentage and its applications /COMMERCIAL MATHEMATICS
Percentage and its applications /COMMERCIAL MATHEMATICSPercentage and its applications /COMMERCIAL MATHEMATICS
Percentage and its applications /COMMERCIAL MATHEMATICSindianeducation
 
Choosing the Right Regressors
Choosing the Right RegressorsChoosing the Right Regressors
Choosing the Right RegressorsAsad Zaman
 
Heteroscedasticity
HeteroscedasticityHeteroscedasticity
HeteroscedasticityMuhammad Ali
 
Introduction to Multiple Regression
Introduction to Multiple RegressionIntroduction to Multiple Regression
Introduction to Multiple RegressionYesica Adicondro
 
131 141 Chi Square Goodness Of Fit
131 141 Chi Square Goodness Of Fit131 141 Chi Square Goodness Of Fit
131 141 Chi Square Goodness Of FitCardinaleWay Mazda
 
Levine smume7 ch09 modified
Levine smume7 ch09 modifiedLevine smume7 ch09 modified
Levine smume7 ch09 modifiedmgbardossy
 
Machine Learning
Machine LearningMachine Learning
Machine LearningShiraz316
 
Linear Regression vs Logistic Regression | Edureka
Linear Regression vs Logistic Regression | EdurekaLinear Regression vs Logistic Regression | Edureka
Linear Regression vs Logistic Regression | EdurekaEdureka!
 
MAT 540(STR) Effective Communication/tutorialrank.com
 MAT 540(STR) Effective Communication/tutorialrank.com MAT 540(STR) Effective Communication/tutorialrank.com
MAT 540(STR) Effective Communication/tutorialrank.comjonhson295
 

What's hot (20)

Statr session 17 and 18 (ASTR)
Statr session 17 and 18 (ASTR)Statr session 17 and 18 (ASTR)
Statr session 17 and 18 (ASTR)
 
Quantitative Methods for Lawyers - Class #21 - Regression Analysis - Part 4
Quantitative Methods for Lawyers - Class #21 - Regression Analysis - Part 4Quantitative Methods for Lawyers - Class #21 - Regression Analysis - Part 4
Quantitative Methods for Lawyers - Class #21 - Regression Analysis - Part 4
 
Linear Regression and Logistic Regression in ML
Linear Regression and Logistic Regression in MLLinear Regression and Logistic Regression in ML
Linear Regression and Logistic Regression in ML
 
Lecture 4
Lecture 4Lecture 4
Lecture 4
 
Quantitative Methods for Lawyers - Class #20 - Regression Analysis - Part 3
Quantitative Methods for Lawyers - Class #20 - Regression Analysis - Part 3Quantitative Methods for Lawyers - Class #20 - Regression Analysis - Part 3
Quantitative Methods for Lawyers - Class #20 - Regression Analysis - Part 3
 
Chapter 15
Chapter 15 Chapter 15
Chapter 15
 
Quantitative Methods for Lawyers - Class #17 - Scatter Plots, Covariance, Cor...
Quantitative Methods for Lawyers - Class #17 - Scatter Plots, Covariance, Cor...Quantitative Methods for Lawyers - Class #17 - Scatter Plots, Covariance, Cor...
Quantitative Methods for Lawyers - Class #17 - Scatter Plots, Covariance, Cor...
 
Quantitative Methods for Lawyers - Class #22 - Regression Analysis - Part 1
Quantitative Methods for Lawyers - Class #22 -  Regression Analysis - Part 1Quantitative Methods for Lawyers - Class #22 -  Regression Analysis - Part 1
Quantitative Methods for Lawyers - Class #22 - Regression Analysis - Part 1
 
Percentage and its applications /COMMERCIAL MATHEMATICS
Percentage and its applications /COMMERCIAL MATHEMATICSPercentage and its applications /COMMERCIAL MATHEMATICS
Percentage and its applications /COMMERCIAL MATHEMATICS
 
Les5e ppt 07
Les5e ppt 07Les5e ppt 07
Les5e ppt 07
 
Choosing the Right Regressors
Choosing the Right RegressorsChoosing the Right Regressors
Choosing the Right Regressors
 
Heteroscedasticity
HeteroscedasticityHeteroscedasticity
Heteroscedasticity
 
Introduction to Multiple Regression
Introduction to Multiple RegressionIntroduction to Multiple Regression
Introduction to Multiple Regression
 
131 141 Chi Square Goodness Of Fit
131 141 Chi Square Goodness Of Fit131 141 Chi Square Goodness Of Fit
131 141 Chi Square Goodness Of Fit
 
Les5e ppt 03
Les5e ppt 03Les5e ppt 03
Les5e ppt 03
 
Levine smume7 ch09 modified
Levine smume7 ch09 modifiedLevine smume7 ch09 modified
Levine smume7 ch09 modified
 
Les5e ppt 08
Les5e ppt 08Les5e ppt 08
Les5e ppt 08
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Linear Regression vs Logistic Regression | Edureka
Linear Regression vs Logistic Regression | EdurekaLinear Regression vs Logistic Regression | Edureka
Linear Regression vs Logistic Regression | Edureka
 
MAT 540(STR) Effective Communication/tutorialrank.com
 MAT 540(STR) Effective Communication/tutorialrank.com MAT 540(STR) Effective Communication/tutorialrank.com
MAT 540(STR) Effective Communication/tutorialrank.com
 

Similar to Predicting Fraud with Logistic Regression

Classification methods and assessment.pdf
Classification methods and assessment.pdfClassification methods and assessment.pdf
Classification methods and assessment.pdfLeonardo Auslender
 
Logistic regression
Logistic regressionLogistic regression
Logistic regressionRupak Roy
 
Ali, Redescending M-estimator
Ali, Redescending M-estimator Ali, Redescending M-estimator
Ali, Redescending M-estimator Muhammad Ali
 
Stability criterion of periodic oscillations in a (16)
Stability criterion of periodic oscillations in a (16)Stability criterion of periodic oscillations in a (16)
Stability criterion of periodic oscillations in a (16)Alexander Decker
 
InstructionDue Date 6 pm on October 28 (Wed)Part IProbability a.docx
InstructionDue Date 6 pm on October 28 (Wed)Part IProbability a.docxInstructionDue Date 6 pm on October 28 (Wed)Part IProbability a.docx
InstructionDue Date 6 pm on October 28 (Wed)Part IProbability a.docxdirkrplav
 
Machine learning in credit risk modeling : a James white paper
Machine learning in credit risk modeling : a James white paperMachine learning in credit risk modeling : a James white paper
Machine learning in credit risk modeling : a James white paperJames by CrowdProcess
 
0 Model Interpretation setting.pdf
0 Model Interpretation setting.pdf0 Model Interpretation setting.pdf
0 Model Interpretation setting.pdfLeonardo Auslender
 
Machine learning algorithms and business use cases
Machine learning algorithms and business use casesMachine learning algorithms and business use cases
Machine learning algorithms and business use casesSridhar Ratakonda
 
Statr session 17 and 18
Statr session 17 and 18Statr session 17 and 18
Statr session 17 and 18Ruru Chowdhury
 
chap4_Parametric_Methods.ppt
chap4_Parametric_Methods.pptchap4_Parametric_Methods.ppt
chap4_Parametric_Methods.pptShayanChowdary
 
Big Data Analysis
Big Data AnalysisBig Data Analysis
Big Data AnalysisNBER
 
Que es un Aprendizaje estadístico en ingeniería Ambiental
Que es un Aprendizaje estadístico en ingeniería AmbientalQue es un Aprendizaje estadístico en ingeniería Ambiental
Que es un Aprendizaje estadístico en ingeniería Ambientaljuliocamelr
 
Machine Learning Algorithm - Linear Regression
Machine Learning Algorithm - Linear RegressionMachine Learning Algorithm - Linear Regression
Machine Learning Algorithm - Linear RegressionKush Kulshrestha
 
Assessing Model Performance - Beginner's Guide
Assessing Model Performance - Beginner's GuideAssessing Model Performance - Beginner's Guide
Assessing Model Performance - Beginner's GuideMegan Verbakel
 

Similar to Predicting Fraud with Logistic Regression (20)

Classification methods and assessment.pdf
Classification methods and assessment.pdfClassification methods and assessment.pdf
Classification methods and assessment.pdf
 
Autocorrelation
AutocorrelationAutocorrelation
Autocorrelation
 
Logistic regression
Logistic regressionLogistic regression
Logistic regression
 
Econometrics
EconometricsEconometrics
Econometrics
 
Risk notes ch12
Risk notes ch12Risk notes ch12
Risk notes ch12
 
Ali, Redescending M-estimator
Ali, Redescending M-estimator Ali, Redescending M-estimator
Ali, Redescending M-estimator
 
Stability criterion of periodic oscillations in a (16)
Stability criterion of periodic oscillations in a (16)Stability criterion of periodic oscillations in a (16)
Stability criterion of periodic oscillations in a (16)
 
Decision analysis
Decision analysisDecision analysis
Decision analysis
 
InstructionDue Date 6 pm on October 28 (Wed)Part IProbability a.docx
InstructionDue Date 6 pm on October 28 (Wed)Part IProbability a.docxInstructionDue Date 6 pm on October 28 (Wed)Part IProbability a.docx
InstructionDue Date 6 pm on October 28 (Wed)Part IProbability a.docx
 
Machine learning in credit risk modeling : a James white paper
Machine learning in credit risk modeling : a James white paperMachine learning in credit risk modeling : a James white paper
Machine learning in credit risk modeling : a James white paper
 
Logistics regression
Logistics regressionLogistics regression
Logistics regression
 
0 Model Interpretation setting.pdf
0 Model Interpretation setting.pdf0 Model Interpretation setting.pdf
0 Model Interpretation setting.pdf
 
Machine Learning (Classification Models)
Machine Learning (Classification Models)Machine Learning (Classification Models)
Machine Learning (Classification Models)
 
Machine learning algorithms and business use cases
Machine learning algorithms and business use casesMachine learning algorithms and business use cases
Machine learning algorithms and business use cases
 
Statr session 17 and 18
Statr session 17 and 18Statr session 17 and 18
Statr session 17 and 18
 
chap4_Parametric_Methods.ppt
chap4_Parametric_Methods.pptchap4_Parametric_Methods.ppt
chap4_Parametric_Methods.ppt
 
Big Data Analysis
Big Data AnalysisBig Data Analysis
Big Data Analysis
 
Que es un Aprendizaje estadístico en ingeniería Ambiental
Que es un Aprendizaje estadístico en ingeniería AmbientalQue es un Aprendizaje estadístico en ingeniería Ambiental
Que es un Aprendizaje estadístico en ingeniería Ambiental
 
Machine Learning Algorithm - Linear Regression
Machine Learning Algorithm - Linear RegressionMachine Learning Algorithm - Linear Regression
Machine Learning Algorithm - Linear Regression
 
Assessing Model Performance - Beginner's Guide
Assessing Model Performance - Beginner's GuideAssessing Model Performance - Beginner's Guide
Assessing Model Performance - Beginner's Guide
 

More from Leonardo Auslender

4_2_Ensemble models and gradient boosting2.pdf
4_2_Ensemble models and gradient boosting2.pdf4_2_Ensemble models and gradient boosting2.pdf
4_2_Ensemble models and gradient boosting2.pdfLeonardo Auslender
 
4_5_Model Interpretation and diagnostics part 4_B.pdf
4_5_Model Interpretation and diagnostics part 4_B.pdf4_5_Model Interpretation and diagnostics part 4_B.pdf
4_5_Model Interpretation and diagnostics part 4_B.pdfLeonardo Auslender
 
4_2_Ensemble models and grad boost part 2.pdf
4_2_Ensemble models and grad boost part 2.pdf4_2_Ensemble models and grad boost part 2.pdf
4_2_Ensemble models and grad boost part 2.pdfLeonardo Auslender
 
4_2_Ensemble models and grad boost part 3.pdf
4_2_Ensemble models and grad boost part 3.pdf4_2_Ensemble models and grad boost part 3.pdf
4_2_Ensemble models and grad boost part 3.pdfLeonardo Auslender
 
4_5_Model Interpretation and diagnostics part 4.pdf
4_5_Model Interpretation and diagnostics part 4.pdf4_5_Model Interpretation and diagnostics part 4.pdf
4_5_Model Interpretation and diagnostics part 4.pdfLeonardo Auslender
 
4_3_Ensemble models and grad boost part 2.pdf
4_3_Ensemble models and grad boost part 2.pdf4_3_Ensemble models and grad boost part 2.pdf
4_3_Ensemble models and grad boost part 2.pdfLeonardo Auslender
 
4_2_Ensemble models and grad boost part 1.pdf
4_2_Ensemble models and grad boost part 1.pdf4_2_Ensemble models and grad boost part 1.pdf
4_2_Ensemble models and grad boost part 1.pdfLeonardo Auslender
 
4 2 ensemble models and grad boost part 3 2019-10-07
4 2 ensemble models and grad boost part 3 2019-10-074 2 ensemble models and grad boost part 3 2019-10-07
4 2 ensemble models and grad boost part 3 2019-10-07Leonardo Auslender
 
4 2 ensemble models and grad boost part 2 2019-10-07
4 2 ensemble models and grad boost part 2 2019-10-074 2 ensemble models and grad boost part 2 2019-10-07
4 2 ensemble models and grad boost part 2 2019-10-07Leonardo Auslender
 
4 2 ensemble models and grad boost part 1 2019-10-07
4 2 ensemble models and grad boost part 1 2019-10-074 2 ensemble models and grad boost part 1 2019-10-07
4 2 ensemble models and grad boost part 1 2019-10-07Leonardo Auslender
 

More from Leonardo Auslender (20)

1 UMI.pdf
1 UMI.pdf1 UMI.pdf
1 UMI.pdf
 
Ensembles.pdf
Ensembles.pdfEnsembles.pdf
Ensembles.pdf
 
Suppression Enhancement.pdf
Suppression Enhancement.pdfSuppression Enhancement.pdf
Suppression Enhancement.pdf
 
4_2_Ensemble models and gradient boosting2.pdf
4_2_Ensemble models and gradient boosting2.pdf4_2_Ensemble models and gradient boosting2.pdf
4_2_Ensemble models and gradient boosting2.pdf
 
4_5_Model Interpretation and diagnostics part 4_B.pdf
4_5_Model Interpretation and diagnostics part 4_B.pdf4_5_Model Interpretation and diagnostics part 4_B.pdf
4_5_Model Interpretation and diagnostics part 4_B.pdf
 
4_2_Ensemble models and grad boost part 2.pdf
4_2_Ensemble models and grad boost part 2.pdf4_2_Ensemble models and grad boost part 2.pdf
4_2_Ensemble models and grad boost part 2.pdf
 
4_2_Ensemble models and grad boost part 3.pdf
4_2_Ensemble models and grad boost part 3.pdf4_2_Ensemble models and grad boost part 3.pdf
4_2_Ensemble models and grad boost part 3.pdf
 
4_5_Model Interpretation and diagnostics part 4.pdf
4_5_Model Interpretation and diagnostics part 4.pdf4_5_Model Interpretation and diagnostics part 4.pdf
4_5_Model Interpretation and diagnostics part 4.pdf
 
4_3_Ensemble models and grad boost part 2.pdf
4_3_Ensemble models and grad boost part 2.pdf4_3_Ensemble models and grad boost part 2.pdf
4_3_Ensemble models and grad boost part 2.pdf
 
4_2_Ensemble models and grad boost part 1.pdf
4_2_Ensemble models and grad boost part 1.pdf4_2_Ensemble models and grad boost part 1.pdf
4_2_Ensemble models and grad boost part 1.pdf
 
4_1_Tree World.pdf
4_1_Tree World.pdf4_1_Tree World.pdf
4_1_Tree World.pdf
 
Linear Regression.pdf
Linear Regression.pdfLinear Regression.pdf
Linear Regression.pdf
 
4 MEDA.pdf
4 MEDA.pdf4 MEDA.pdf
4 MEDA.pdf
 
2 UEDA.pdf
2 UEDA.pdf2 UEDA.pdf
2 UEDA.pdf
 
3 BEDA.pdf
3 BEDA.pdf3 BEDA.pdf
3 BEDA.pdf
 
1 EDA.pdf
1 EDA.pdf1 EDA.pdf
1 EDA.pdf
 
0 Statistics Intro.pdf
0 Statistics Intro.pdf0 Statistics Intro.pdf
0 Statistics Intro.pdf
 
4 2 ensemble models and grad boost part 3 2019-10-07
4 2 ensemble models and grad boost part 3 2019-10-074 2 ensemble models and grad boost part 3 2019-10-07
4 2 ensemble models and grad boost part 3 2019-10-07
 
4 2 ensemble models and grad boost part 2 2019-10-07
4 2 ensemble models and grad boost part 2 2019-10-074 2 ensemble models and grad boost part 2 2019-10-07
4 2 ensemble models and grad boost part 2 2019-10-07
 
4 2 ensemble models and grad boost part 1 2019-10-07
4 2 ensemble models and grad boost part 1 2019-10-074 2 ensemble models and grad boost part 1 2019-10-07
4 2 ensemble models and grad boost part 1 2019-10-07
 

Recently uploaded

Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
Digi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxDigi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxTanveerAhmed817946
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...Suhani Kapoor
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 

Recently uploaded (20)

Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
Digi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxDigi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptx
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
Decoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in ActionDecoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in Action
 

Predicting Fraud with Logistic Regression

  • 2. 2019-06-20 Leonardo Auslender Copyright 2009 2019-06-20 2 Contents Present practice in classification methods and interpretation of probability. Logistic Regression. Odds. Coefficient interpretation and p-values. Model Performance and Assessment. Gains Charts, ROC. Variable Selection. Issues: Balanced / unbalanced samples. Separation. Canonical Discriminant Analysis
  • 3. 2019-06-20 Present Practice, issues and headaches. Classification is Data Mining area par excellence. Will focus on binary targets of events/non-events. Research & applications in: Clinical data analysis (disease / no-disease, insistence on odd ratios and logistic regression), Direct marketing: response/no-response, attrition, etc. Recommender systems: Interesting / uninteresting. Fraud, Terrorism, banking …….. Issues and Headaches (can’t cover them all in this lecture): -Binary target training/estimation mixture of events/non-events. -Obfuscating terminology. -Model comparisons. -Modeling methodologies confront unexpected issues: co-linearity and separation in logistic (and neurals?), smooth and non-step response function in trees, etc.
  • 4. 2019-06-20 Present Practice, issues and headaches. Unclear practices on mixture of 0/1 of target variable for estimation, and relative cost of misclassification of 0/1. Obfuscating terminology of ROC, precision, model choice, etc. Concepts used in classification methods to compare models are derived from different methodologies (i.e., trees, neurals, etc). Mixture of methods used in practice. Unclear practices on separation, co-linearity and variable selection. Co- linearity likely more of a bete-noire than in linear regression case, and adds doubts about stability of predicted probability in context of scoring future data bases. All models produce predictions in probability or rank probability forms  decision has to be made about cutoff point or not. In next pages: Events = “1”, non-Events = “0” (usually 1, -1 in engineering and science applications).
  • 5. 2019-06-20 Meaning of Probability Statements: context dependent. Probability measured in interval (0, 1) + methods are “mechanical”, i.e., context provided by analyst. E.g: 1. Model estimates household has %70 probability of responding to credit card solicitation. Solicitation cost is minimal and bad feeling if non-responding customer didn’t want to be solicited, is disregarded. Likely action: solicit. 2. Model estimates that probability that conference ceiling will fall on us right now is 40%. How many of you will stay until I finish reading this paragraph? Action: Run for your life? Sue the presenter? 3. DNA matching asserts that probability ( male A is father of baby) = 95%, i.e., 1 in 20 is False Positive. Action: A is father? 4. The probability of a devastating earthquake in NYC in the next 24 hours is negligible but still non-zero. Nobody seems to care for this element of probability, why? Would the same apply in LA with higher probability?  Cost (profit) of implementing/not implementing decision, even if not exactly quantifiable, is most important in context. CONTEXT, CONTEXT, CONTEXT.
  • 8. 2019-06-20 Binary Dependent Variable – Some Definitions. 1) Standard linear Model Y* = X + , but now Y not continuous:  usual assumptions and k usually unknown. Ergo, estimate Y*? Assume k = 0 below for ease of exposition. Let  = Pr (Y = 1 / X = x) = Pr (X +  ) > 0 = Pr ( > - X) = = 1 – F (- X), F is CDF of . Since E () = 0  E (Y*) = X , expanded in terms of Y as: E (Y) = ∑  i yi =  (1) + (1 - ) (0) =    = X. Problem: if Y is 0/1   can take only two values, and therefore is not normal (do we care?). 1 * , * . 0 . . ., 'good' student if GPA >3. {   if Y k Y continuous Otherwise e g Y
  • 9. 2019-06-20 Binary Dependent Variable – Some Definitions to drop linear model. If Yi = 1  i = 1 – E (Yi) = 1 - Xi = 1 - i, occurs with prob. i, if Y = 0   = -, which occurs with probability (1 - ): error not normally but binomially distributed, and its variance not constant (If  = .5  maximum variance = .25. If . = .01 (in tails)  variance = .0099.   → 1, variance → 0)  any predictor affecting mean also affects variance  usual linear model not applicable because model assumes constant variance. Obviously, estimated  in Y* above may lie outside [0;1]  est. var. < 0. Still could rescale predicted values to lie within 0-1 (Foster & Stine, 2004), but how to compare two models that predict at different values > 1? (rescaling to (0, 1) is quite ad-hoc.) More serious reason to drop linear model: Linear marginal effects: linear model  ΔX on prob (Y) is constant regardless of initial value of prob( Y). E.g.,: ΔX = 1  Δprob (Y) = .1 whether prob (Y) = .5 or prob (Y) = .93. Intuitively, Δprob (Y) should be larger for prob (Y) closer to 0.5.
  • 10. 2019-06-20 Contrived example to motivate marginal effects. Suppose urn with 50 Red balls and 50 Blue balls. The exercise is to determine percentage of Red balls in urn, and ‘effort’ required to increase probability of obtaining a red ball. “Effort” (i.e., X variable) is number of additional Red balls to add to increase probability. Starting for 50/50 mixture, X = 0 (i.e., not adding red ball yet), just gives 50%. To reach 51%, 52%, need to solve for X in equation (B remains at 50, r = # reds, b = # blues): Let’s see how to raise prob from .5 to 0.95 in jumps of 0.05 by raising r: * 1 r b prob prob r r b prob      Obs b prob r 1 50 0.50 50.000 2 50 0.55 61.111 3 50 0.60 75.000 4 50 0.65 92.857 5 50 0.70 116.667 6 50 0.75 150.000 7 50 0.80 200.000 8 50 0.85 283.333 9 50 0.90 450.000 10 50 0.95 950.000 / . not constant, as in linear regression prob X  It takes an increasing # Of red balls to increase Prob.
  • 11. 2019-06-20 Prob rises from .75 to .8 for red rising from 150 to 200, but going from .90 to .95 requires to add red balls from 450 to 950.
  • 12. 2019-06-20 Binary Dependent Variable – Some Definitions. For ordinal or nominal dependent variables, coding is arbitrary, and any transformation gives different results.  transformation of . Need non decreasing function of X to unit interval. Function usually chosen is CDF of unit-normal distribution  PROBIT method, or standardized logistic distribution,  linear logistic or logit model (more convenient, no integrals, similar to t-distribution with 7 dfs.). Non- symmetric: log-log and Complimentary log-log (cloglog used in survival and interval-censored models). NB: Purpose is to model probability of occurrence of event.
  • 13. 2019-06-20  Logistic: P1 = 1 / (1 + exp (- X * 3)) Probit: P1 = Inv_norm ( X * 3 * / ) Linear: P1 = .5 + X / 3 Complimentary log-log (extreme value, Gompit, note its skewness): P1 = 3 In next slide :      Not shown : log log : log( log( )) (seldom used because of inappropriate behavior for <.5) For which link is "best" see Koenker and Yoon (2009, Journal of Econometrics). 1- exp(-exp (x * 3)) Different functions.
  • 14. 2019-06-20 Binary Dependent Variable – Comparison of approaches. Cloglog approaches 1 faster.
  • 15. 2019-06-20 Binary Dependent Variable, odds and log odds. Logistic is easier than probit mathematically. Note that the logit (‘linear equation’, ) estimates log-odds, not probability. Log-odds popular in gambling: e.g., odds of 19 to 1 that the ‘house’ will win (equivalent to 95% probability of winning for the house). π: nonlinear function, requires iterative method of estimation. NB: Will skip individual parameter inference. 1 [1 ] ,logistic cdf 1 Interpretable as log-odds: Odds 1 log-odds or logit.log( ) 1 X X X X e e e e X                         X  X 
  • 16. 2019-06-20 X = Unif * 2 * N (9) Y = 1/ (exp(-1.5*x))
  • 17. 2019-06-20 Example of gambling odds. What is the meaning of 1 / p? Especially in the case of rare events, that is when the probability of an event is low, the reciprocal of the probability, 1 / p, provides a 1 in ‘n’ re-scale that can be informative. For instance, if prob (event) = 0.018, its reciprocal is about 1 in 552 tries. If the probability distribution is geometric, then when p = 0.018, you would need 552 flips of a coin to obtain the desired event. We can relate this topic to odds, which will be useful later on. If odds of winning a bet are 1.25, and you bet 8, then you get 8 * 1.25 = 10 back when you win (including the original bet, for a total profit of 2), and nothing in the case of a loss. The bet would be fair if the probability of winning was 8/ 10 = 0.8, the reciprocal of which is 1.25. In the context of survey sampling, the reciprocal of the probability of being included is called the sampling weight.
  • 20. 2019-06-20 Fraud Data Notice 1From ..... **************************************************************** 1 .............. Basic information on the original data set.s: 1 .............. .. 1 .............. Data set name ........................ train 1 .............. Num_observations ................ 3595 1 .............. Validation data set ................. validata 1 .............. Num_observations .............. 2365 1 .............. Test data set ................ 1 .............. Num_observations .......... 0 1 .............. ... 1 .............. Dep variable ....................... fraud 1 .............. ..... 1 .............. Pct Event Prior TRN............. 20.389 1 .............. Pct Event Prior VAL............. 19.281 1 .............. Pct Event Prior TEST ............ 1 ***************************************************************** 1
  • 21. 2019-06-20 Model Fit Statistics Intercept Intercept and Criterion Only Covariates AIC 2320.999 1968.388 SC 2326.767 2008.768 -2 Log L 2318.999 1954.388 R-Square 0.1429 Max-rescaled R-Square 0.2286 Testing Global Null Hypothesis: BETA=0 Test Chi-Square DF Pr > ChiSq Likelihood Ratio 364.6107 6 <.0001 Score 400.9366 6 <.0001 Wald 268.8171 6 <.0001 Typical output for logistic regression, similar to linear.
  • 22. 2019-06-20 Analysis of Maximum Likelihood Estimates Standard Wald Parameter DF Estimate Error Chi-Square Pr > ChiSq INTERCEPT 1 -0.3390 0.2145 2.4980 0.1140 DOCTOR_VISITS 1 -0.00609 0.00820 0.5517 0.4576 MEMBER_DURATION 1 -0.00657 0.000807 66.4090 <.0001 OPTOM_PRESC 1 0.1429 0.0307 21.5918 <.0001 TOTAL_SPEND 1 -0.00002 6.015E-6 8.1934 0.0042 NO_CLAIMS 1 0.7634 0.0550 192.3966 <.0001 NUM_MEMBERS 1 -0.1089 0.0596 3.3309 0.0680 Odds Ratio Estimates Point 95% Wald Effect Estimate Confidence Limits DOCTOR_VISITS 0.994 0.978 1.010 MEMBER_DURATION 0.993 0.992 0.995 OPTOM_PRESC 1.154 1.086 1.225 TOTAL_SPEND 1.000 1.000 1.000 NO_CLAIMS 2.146 1.926 2.390 NUM_MEMBERS 0.897 0.798 1.008
  • 24. 2019-06-20 Coefficient interpretation: Odds, Odds-ratios (non-interactive model).                              1 2 1 2 ( 1 1) 2 2 . log ( ) when 1 0, 2 ˆ log( ) log 1 2 ˆ 1, ( / 1 1) ( / 1 1 1) 0, log ( ) for 1 ( a bX cX a bx cX a b x cX X any value estimated odds a bX cX b Odds e Odds Y X x Od a odds Y X s Y X d X x odds OR ratio Y X e e Odds io XRat                           ( 1 1) 2 1 2 .04 1.0408 ( ) 4.08% 1. 1 1 ( ) ( 1)% 1) b b b a b x cX a bx cX If b e Odds Y for X X Odds Y e e e e
  • 25. 2019-06-20 Coeff. interpretation - dummy var – non-interaction model. α+β α+β α+β Yε{0,1}, Xε{0,1}....logit(Y) = α + βX + ε e Pr(Y=1/X=1)= .............Finding odds 1+e Pr(Y=1/X=1) Pr(Y=1/X=1) = =e ,,,,, odds X=1. Pr(Y=0/X=1) 1-Pr(Y=1/X=1) Pr(Y=1/X=0) =...................... Pr(Y=0/X=0) α α+β β α α+rβ ........=e ,,,,,,,,, odds X=0. e Odds Ratio: ratio of odds= =e : e More generally: if Xε{r,s},odds X=r: e (similarly for X = s). Odds ratio β is the change in log odds due to one unit change of X. b s-r of increase of X from r to s is:{e }
  • 26. 2019-06-20 Coefficient interpretation: Odds-ratios (interactive model). NB: If X2 measured in deviation terms, say from mean  eb is odds-ratio for X2 = mean.                                           1 2 1 2 2 1 2 ( 1 1) 2 ( 1 1) 2 ( 1 1) 2 ( 1 1) 2 2 1 2 1 2 ˆ log( ) . log 1 2 1 2 ˆ ( / 1 1) ( / 1 1 1) ( 1) a bX cX dX X a bd cX dx X a b x cX d x X a b x cX d x X b dX a bx cX dx X est odds a bX cX dX X Odds e Odds Y X x e Odds Y X x e e e Odds Ratio X e e     ( 1), 2 0 . b is NOT odds ratio X except when X odds ratio is conditional on X values
  • 27. 2019-06-20 Coefficient interpretation: Odds-ratios (interactive model). ( 1 2 1) ( 1 1) 2 ( 1 1) 2 2 1 2 1 2 ( 1 1) 2 ( 1 1) 2 1 1 2 1 2 ˆ log( ) . log 1 2 1 2 ˆ ( 1, 2) ( 1) ( 2)                                         b c d X X a b x cX d x X b dX a bx cX dx X a b x cX d x X c dX a bx cX dx X est odds a bX cX dX X Odds Ratio X X e e e Odds Ratio X e e e Odds Ratio X e e ( 1, 2) ( 1) * ( 2)  d OR X X OR X OR X
  • 29. 2019-06-20 Analysis of Maximum Likelihood Estimates (repeated) Standard Wald Parameter DF Estimate Error Chi-Square Pr > ChiSq INTERCEPT 1 -0.3390 0.2145 2.4980 0.1140 DOCTOR_VISITS 1 -0.00609 0.00820 0.5517 0.4576 MEMBER_DURATION 1 -0.00657 0.000807 66.4090 <.0001 OPTOM_PRESC 1 0.1429 0.0307 21.5918 <.0001 TOTAL_SPEND 1 -0.00002 6.015E-6 8.1934 0.0042 NO_CLAIMS 1 0.7634 0.0550 192.3966 <.0001 NUM_MEMBERS 1 -0.1089 0.0596 3.3309 0.0680 Odds Ratio Estimates Point 95% Wald Effect Estimate Confidence Limits DOCTOR_VISITS 0.994 0.978 1.010 MEMBER_DURATION 0.993 0.992 0.995 OPTOM_PRESC 1.154 1.086 1.225 TOTAL_SPEND 1.000 1.000 1.000 NO_CLAIMS 2.146 1.926 2.390 NUM_MEMBERS 0.897 0.798 1.008 Logistic regr. Output for coefficient evaluation.
  • 30. 2019-06-20 Interpretation. Intercept: exp(-0.3390) is the odds-ratio 0.71, ratio of probability of fraud versus non-fraud when all predictors are zero, i.e., no doctor visits, no prescriptions, zero member duration, etc. Member Duration: 0.993 (odds ratio point estimate) is the odds ratio and is less than 1 which means that the ratio of probability of fraud versus non-fraud is less than 1 the longer the member duration. Odds ratio is constant as long as there is no interaction term. Number of claims odds ratio of 2.146 means that the ratio of the probability of fraud versus non-fraud increases with the number of claims, as long as there is no interaction term. Etc.
  • 35. 2019-06-20 Leonardo Auslender Copyright 2009 2019-06-20 Binary Dependent Variable – Log-likelihood. *** Since observations assumed independent, joint probability is: (and take logs because it is easier to work with logs): Y 1 Y 1 n Log likelihood function (omitting"i" subscr Pr(Y ,,,, Y ) (1 ) , taking logs, and remembering that f(Data, ) : ln( , data) Y ln (1 Y)ln(1 ) ln( ipt) , data) Y ln ln(1 ) 1                             35
  • 36. 2019-06-20 Binary Dependent Variable Model evaluation via Inference. Nested Approach: Does model with added predictor/s provide “significantly” more information about dependent variable than model w/o? Typically, H0: fuller model, H1: constrained model (NB: no notion of model fit as in R2 of regression). Compare two nested models: Log (odds) =  + 1x1 + 2x2 + 3x3 + 4x4 (model 1) H0 Log (odds) =  + 1x1 + 2x2 (model 2) H1 Three tests: –Likelihood ratio Test (LRT): balanced between H0 and H1. –Wald test: starts at H1 and looks for improvement towards H0. –Score or Lagrange Multiplier (LM) test: starts at H0 and asks whether movement towards H1 is improvement. 1st derivative of likelihood function is called score function.
  • 37. 2019-06-20 Leonardo Auslender Copyright 2009 2019-06-20 Model χ2 and Model Comparisons. Likelihood-ratio test: used to contrast two models, one of which is subset of other (i.e., some j’s set to zero, called Nested model Inference). G0 = 2 (log L1 – log L0)  χ2 (k), k coefficients set to 0 in formulae (called Deviance, and sometimes residual deviance): G0 called -2 LLR or 2 LLR (dependent on order of likelihoods). Bit of confusion: -2 LLR is also deviance, but specific to model vs. saturated model, and saturated model has LLR = 0. Is Likelihood same as probability? In the case of probability, we know the pdf’s (prob. Distr. Function) and parameter values and want to know probability of specific event/s. I.e., we know betas …. In case of likelihood, given specific event/s, estimate pdf and/or model parameters. That’s why, given data, want to estimate parameters that maximize the probability of the event/s reflected by the data, for an assumed pdf. We don’t know but estimate Betas … 37
  • 38. 2019-06-20 Score (LM) Test. Let U(β) be vector d LLR / d β: Gradient vector. Let H(β) be matrix d2 LLR / d β2 Hessian matrix. Let I(β) = - H(β) or expected value of - H(β). Let β0 be MLE estimates under H0.  U’(β0) I(β0) -1 U(β0) ~ χ2(r)
  • 39. 2019-06-20 Example, Y ~ B (n, theta) *** H0: theta = 0.5, H1: theta = 0.2. Wald based on (0.5 – 0.2), LRT on vertical distance, LM on slope of line at loglikelihod at theta = 0.2. When loglikelihood smooth curve, all tests yield same answer. LOGLIKEL1 -5000 -4000 -3000 -2000 -1000 0 THETA 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 SCORE TEST IS SLOPE AT THETA = 0.2 HALF DEVIANCE ----WALD STATISTIC-------
  • 40. 2019-06-20 Non-nested Model Comparisons: Akaike Information Criterion (AIC) = Deviance (=-2LRR) + 2k, k = # model parameters. BIC (Bayesian information criterion) or SC = Deviance + k log n. In both cases, smaller  better model. NOTE: in non-parametric models, such as Tree based models, ‘k’ is not fully defined  no AIC or BIC fully accepted.
  • 41. 2019-06-20 Leonardo Auslender Copyright 2009 2019-06-20 Somer’s D and Pairs Concordance. P = # possible pairs observations with different values of Y = n0 n1. % Concordant = % Pairs / Prob (Y = 1 / Y = 1) > Prob (Y = 1 / Y = 0), nc / P. % Discordant = % Pairs / Prob (Y = 1 / Y = 1) < Prob (Y = 1 / Y = 0), nd / P. % Tied: % pairs neither concordant nor discordant, nt / P. (all pairs discordant) -1 ≤ Somer’s D ≤ 1 (all pairs concordant) = (nc – nd) / (nc + nd + nt), Gamma = (nc – nd) / (nc + nd) Stuart Tau a = (nc – nd) / P c = .5 (1 + Somer’s D), called ROC more popularly (reviewed below). 41
  • 42. 2019-06-20 Leonardo Auslender Copyright 2009 2019-06-20 Hosmer-Lemeshow fit (2000, p. 148). Posterior probabilities in logistic regression are calculated from covariate patterns. In data mining, most patterns have very few observations, and many potential patterns, are empty. Assume 5 binary predictors in the model  maximum number of patterns = 2 ^ 5 = 32. Assume only 8 patterns exist (J) in the data  expected values will be small or 0 in most cells. In data mining, typically n = J. With continuous predictors, number of patterns is much larger. 42
  • 43. 2019-06-20 Leonardo Auslender Copyright 2009 2019-06-20 Hosmer-Lemeshow fit “decile of risk” (2000, p. 148). Proposal: “create patterns” by grouping by percentiles of the posterior predicted distribution. By simulation, if g = 10 percentiles (deciles) chosen, distributed as chi-square (g – 2) when J = n. Use Pearson’s χ2 test, small p-values indicate lack of fit. Problem: Hosmer-Lemeshow test known to fail with continuous covariates due to large number of potential patterns. Besides, known to produce misgiving results due to ties or order of observations (Bertolini et al, 2000). Test considered obsolete (Hosmer et al, 1997). DO NOT USE IT. 43
  • 44. 2019-06-20 2 2 2 2 2 2 2 (0) : 1 [ ] 1 (0) ˆ( ) (1991) : (obtained via RSQUARE option in SAS Proc Logistic, called Rsquare and Ma Cox and Snell (1989) Nagelkerke Brie x-Rescaled r's S R r cor esp e: ectively). 1 ˆ( N N i L L Max R L R Max R P n R      2 1 ) n i i Y   What would statisticians do without R2? Create more …
  • 45. 2019-06-20 Leonardo Auslender Copyright 2009 2019-06-20 45 (1) (2) (3) (4) (5) (6)
  • 46. 2019-06-20 Leonardo Auslender Copyright 2009 2019-06-20 46 (1) Significant LRT, at least one Beta significantly different from 0. Analogous to F-test in Linear Regression. Agreement with (2) and (3). If no agreement, model is in doubt. (2) (4) and (5) confirm that intercept-only model is inferior. (3) In (6), difference between Intercept only model and Intercept and covariates yields (1).
  • 47. 2019-06-20 Association of Predicted Probabilities and Observed Responses Percent Concordant 75.4 Somers' D 0.512 Percent Discordant 24.2 Gamma 0.515 Percent Tied 0.4 Tau-a 0.160 Pairs 870504 c 0.756 Partition for the Hosmer and Lemeshow Test FRAUD = 1 FRAUD = 0 Group Total Observed Expected Observed Expected 1 237 13 10.71 224 226.29 2 237 18 17.08 219 219.92 3 237 15 21.88 222 215.12 4 237 25 26.73 212 210.27 5 237 34 31.53 203 205.47 6 237 36 37.13 201 199.87 7 237 39 43.19 198 193.81 8 237 53 52.97 184 184.03 9 237 84 71.69 153 165.31 10 232 139 143.09 93 88.91
  • 48. 2019-06-20 Leonardo Auslender Copyright 2009 2019-06-20 48
  • 49. 2019-06-20 Leonardo Auslender Copyright 2009 2019-06-20 Concordance Measures. STUDY Cloglog Logit Probit VALUE VALUE VALUE MEASURE 0.545 0.565 0.565Gamma Pairs 138659.0 138659.0 138659.0 Percent Concordant 71.400 77.128 77.117 Percent Discordant 21.033 21.430 21.415 Percent Tied 7.567 1.442 1.468 Somers' D 0.504 0.557 0.557 Tau-a 0.245 0.271 0.271 c 0.752 0.778 0.779 Example: Logit and Probit almost identical. 49
  • 51. 2019-06-20 Misclassification Rates (typically assumes 0.5 cutoff) Confusion Table. 0: “Negative” (non-event) “1”: Positive (event) TN: True Negative FN: False Negative FP: False Positive TP: True positive. “True”, “False” refer to actual state of nature. “Negative” and “Positive” refer to prediction. 51 Actual Pr edicted 0 1 0 non event TN FP 1 Event FN TP   
  • 52. 2019-06-20 Misclassification Rates, K-S, ROC (cont. 1). 1) Classification accuracy or overall classification rate: (1 – classification error rate, converse of misclassification rate), “0” and “1” Classification Rates: Overall: TP + TN / (TP + TN + FP + FN). Assumes known and unchanging “natural” class distribution and that error cost of FP = errors FN. Typically favors majority class; but in most applications, cost of misclassifying “1” is higher. “0” and “1” Classification Rate: TN / (TN + FP) ………..…. “0” TP / (TP + FN) ……….…. “1” Misleading results: Assume “1” important. Overall (left model) = 92.5%, Overall (right model) = 97.5%, but right model misses all “1”s. Predicted 0 Predicted 1 Predicted 0 Predicted 1 Actual 0 180 15 195 0 Actual 1 0 5 5 0 52
  • 53. 2019-06-20 Misclassification Rates, K-S, ROC 2) Event (“1”), non-Event (“0”) and overall Precisions: PR(1) = TP / (TP + FP) Event Precision PR(0) = TN / (TN + FN) Non-Event Precision PR = (TP + TN) / 2 (Theoretical max for PR (i) = 1). (of those predicted as “1”, proportion of “true” ones …) 3) Area under Receiver Operating Characteristic Curve (AUROC): Graph of FP rate (x-axis) vs. TP rate (Y-axis) varying threshold of classification. Laplace estimate is used (in tree algorithms) because more consistent improvements in ROC curves. TP rate = TP / (TP + FN) (sensitivity) FP rate = FP / (FP + TN) ( 1 – specificity) TN / (FP + TN) specificity 53
  • 54. 2019-06-20 54 Recall, F-1 … Recall / Sensitivity / TPR: TP / TP + FN. Percentage of events well classified. F1-score / F-measure / F-score: harmonic mean of precision and recall. Requires that R and P not move opposite to each other. Useful when costs of misclassification between 0 and 1 are very different. If Costs similar and counts of FN ≈ FP, can use accuracy. Higher F1, better. Also, weighted F1-score used to give more importance to P or to R. 2 * (recall * Pr ecision) F1 Re call Pr ecision   Specificity / TNR: TN / TN + FP
  • 55. 2019-06-20 55 Which measure to use? If costs similar and counts of FN and FP similar and Recall does not move opposite Precision, use F1. Also, when event prior much smaller. Choose Recall if FN is more important than FP. E.g., misdiagnosing cancer patient as healthy. Choose prediction when want to capture as many positives as possible. E.g., Spam, prefer to miss SPAM detection as true, rather than sending good message to spam folder. Specificity if want to concentrate on non-events and avoid False alarms (FP). E.g., trial verdict.
  • 57. 2019-06-20 Validation: Rates '-' ==> misclass & Missprec Predicted Class 0 1 Overall Class Rate Prec Rate Class Rate Prec Rate Class Rate Prec Rate Fraudulent Activity yes/no Model Name 76.87 88.53 -23.13 -59.64 76.870 M1_TRN_LOGISTIC_NONE 0 M1_VAL_LOGISTIC_NONE 78.31 88.62 -21.69 -61.06 78.31 1 M1_TRN_LOGISTIC_NONE -38.88 -11.47 61.12 40.36 61.12 1 M1_VAL_LOGISTIC_NONE -42.11 -11.38 57.89 38.94 57.89 Overall M1_TRN_LOGISTIC_NONE 88.53 40.36 73.66 73.66 Overall M1_VAL_LOGISTIC_NONE 88.62 38.94 74.38 74.38 Model classifies correctly 76.87% of non-frauds as non-frauds And misclassifies 23.13%, with probability cutoff point of 50%.
  • 61. 2019-06-20 Misclassification Rates, K-S, ROC Heuristically, AUROC = # pairs of observations, one ‘0’ one ‘1’, such that posterior P(X / Y = 1) > P(X / Y = 0). Max value is 1. Classification methods typically maximize classification rates  tendency to focus just on these rates, which does not necessarily provide good balance between events and non-events. ROC curves can cross, e.g. data sets with different proportions of 0/1. NW direction  better model. B is preferred to C for low FP rates, but C preferred over B later on. A clearly inferior. “Best” model: max AUROC. But max AUROC may not be ‘best’ for specific cost and class distribution. Plus, based on classification and not precision measures (detailed below). 61
  • 62. 2019-06-20 Misclassification Rates, K-S, ROC ROC: 1) step function built by varying classification threshold. Different algorithms to ‘smooth’ curves. 2) invariant to any monotonic transformation of posterior probabilities, because just rankings matter. 62
  • 63. 2019-06-20 Leonardo Auslender Copyright 2009 2019-06-20 63 AUROC comparison: Gradient Boosting better than logistic For TRN and VAL.
  • 64. 2019-06-20 Leonardo Auslender Copyright 2009 2019-06-20 64
  • 65. 2019-06-20 Leonardo Auslender Copyright 2009 2019-06-20 65
  • 66. 2019-06-20 Leonardo Auslender Copyright 2009 2019-06-20 66 .... Precision P - Recall R and P R vs. cutoff
  • 67. 2019-06-20 Leonardo Auslender Copyright 2009 2019-06-20 67
  • 68. 2019-06-20 Leonardo Auslender Copyright 2009 2019-06-20 68 Ideally, curve reaches NE corner for better model, and Gradient boosting arches NE more than Logistic, equivalent to building PR-AUR and compare areas (not shown). GB better than LG. Only TRN shown.
  • 69. 2019-06-20 Leonardo Auslender Copyright 2009 2019-06-20 69 Standard recommendation For balanced data (i.e., prior about 50% and equal costs), choose most outward ROC curve as model of choice. Otherwise, use Precision Recall curve. Notice that non-events affect mostly ROC curve. NOTE: we included Gradient Boosting (reviewed at later chapter to emphasize model comparisons.
  • 70. 2019-06-20 Leonardo Auslender Copyright 2009 2019-06-20 70
  • 71. 2019-06-20 Explanation via Example: Experian – White Paper. Lift or gainschart for credit promotion. Table can be created by decile (10% intervals) or vingtiles (20%) intervals of posterior prob. of being “good”, in descending order of prob. More often binned by equal sized bins of observations in descending order of probability. Max potential lift is 1 / event prior percentage. When comparing different models by deciles, note that underlying probabilities can be different across different models (e.g., logistic vs Gradient Boosting). Cum Good % (Bads) measures cumulative % of all Goods (Bads) captured in corresponding vingtile. % Cum Diff is corresponding difference between goods and bads, and its largest difference is value of K-S test. Rule of thumb used: good model should capture at least 70% goods in 5th decile / 10th vingtile. Lift: ratio of % events in decile/vingtile to overall event rate (# responders in decile/vingtile / total # responders). Cum lift: Corresponding cumulative. K-S (Kolmogorov-Smirnov): prob. point at which [Cum % capture events - non- events] is highest. Frequently used in finance. 71
  • 72. 2019-06-20 Example: Experian – White Paper – Gains Table for goods vs bads (data set not available). 72
  • 74. 2019-06-20 Gains Table % Event Cum % Events % Capt. Events Cum % Capt. Events Lift Cum Lift Pctl Min Prob Max Prob Model Name 59.92 59.92 31.14 31.14 3.11 3.11 10 0.384 0.997 M1_VAL_LOGISTIC_NONE 0.424 1.000 M1_TRN_LOGISTIC_NONE 63.61 63.61 31.24 31.24 3.12 3.12 20 0.250 0.382 M1_VAL_LOGISTIC_NONE 35.59 47.78 18.42 49.56 1.85 2.48 0.263 0.423 M1_TRN_LOGISTIC_NONE 35.38 49.51 17.33 48.57 1.74 2.43 30 0.199 0.250 M1_VAL_LOGISTIC_NONE 21.94 39.15 11.40 60.96 1.14 2.03 0.207 0.263 M1_TRN_LOGISTIC_NONE 23.06 40.69 11.32 59.89 1.13 2.00 40 0.169 0.199 M1_VAL_LOGISTIC_NONE 15.68 33.30 8.11 69.08 0.81 1.73 0.177 0.207 M1_TRN_LOGISTIC_NONE 18.38 35.12 9.00 68.89 0.90 1.72 50 0.143 0.168 M1_VAL_LOGISTIC_NONE 15.19 29.67 7.89 76.97 0.79 1.54 0.152 0.177 M1_TRN_LOGISTIC_NONE 14.17 30.92 6.96 75.85 0.69 1.52 60 0.123 0.143 M1_VAL_LOGISTIC_NONE 15.25 27.27 7.89 84.87 0.79 1.41 0.130 0.152 M1_TRN_LOGISTIC_NONE 15.60 28.37 7.64 83.49 0.77 1.39 70 0.103 0.123 M1_VAL_LOGISTIC_NONE 9.70 24.76 5.04 89.91 0.50 1.28 0.107 0.130 M1_TRN_LOGISTIC_NONE 13.33 26.22 6.55 90.04 0.65 1.29 80 0.082 0.103 M1_VAL_LOGISTIC_NONE 6.36 22.46 3.29 93.20 0.33 1.17 0.085 0.107 M1_TRN_LOGISTIC_NONE 10.03 24.20 4.91 94.95 0.49 1.19 90 0.061 0.081 M1_VAL_LOGISTIC_NONE 8.02 20.85 4.17 97.37 0.42 1.08 0.063 0.085 M1_TRN_LOGISTIC_NONE 6.39 22.22 3.14 98.09 0.31 1.09 100 0.004 0.061 M1_VAL_LOGISTIC_NONE 5.08 19.28 2.63 100.00 0.26 1.00 0.063 M1_TRN_LOGISTIC_NONE 3.90 20.39 1.91 100.00 0.19 1.00
  • 75. 2019-06-20 TRN and VAL results very similar, no overfitting.
  • 76. 2019-06-20 Selecting Cutoff point (remember ‘k’ of first pages?) Many criteria, but BRAINS are most important however. KS (Kolmogorov-Smirnof): Probability point at which events and non-events are most separated cumulatively. Cumulative Lift: Probability point for selective cum lift value (arbitrary). Profit/Cost Business decision: Prob point at which Max Profit / Min Cost. Many others. BUT: 1) Consider costs of decision: mail piece to wrong person or wrong HIV treatment? 2) Are categories under study discrete or were they derived from a continuous scale, such as deciding default as 3 month late payment. If so, shouldn’t decision take into consideration how late payment was? 3) If probabilities are too close to decision point, shouldn’t we get more data to decide?
  • 78. 2019-06-20 Lines at 0.3 and 0.207 are different Cutoffs. Note that Proportion of true Events is higher than prob Indicated by Logistic at about point 0.3. ***
  • 79. 2019-06-20 Leonardo Auslender Copyright 2009 2019-06-20 79
  • 80. 2019-06-20 Forward selection for logistic regression (Hosmer and Lemeshow, 2000) Assume ‘p’ available predictors. At step 0 (no variables yet entered), fit “intercept- only model” and denote log-likelihood by L0. Fit ‘p’ univariate logistic regressions for each predictor, and obtain ‘p’ log-likelihoods, L1,,,,,Lp. Calculate ‘p’ likelihood ratio tests, Gj = -2 ( L0 – Lj), j = 1…p, and obtain corresponding p-values pj, such that Prob [χ2(ν) / G]j) = pj, where ν = 1 if Xj is continuous, and ν = k – 1, if Xj has k categories. Choose predictor Xj with minimum p-value that is below the entry ‘alpha’ level, which usually ranges between .15 and .2. Call this predictor Xj0. In first step above, logistic regression with Xj0 takes previous role of intercept-only regression, and new search is started with p -1 candidate predictors. Search is stopped when no p-value is less than the entry alpha level. While α = 5% is embedded in many practitioners’ activities, it is known that in variable selection that level is too stringent. It is usually recommended that “α” level for entry be in the range from 15% to 20% or even higher dependent on the breadth or exploratory nature of the study. Notice that at least in Hosmer and Lemeshow (2000), remaining predictors are not orthogonalized relative to the one just selected (as in variable selection for linear regression. Orthogonalization obtained when partialing).
  • 81. 2019-06-20 Stepwise Selection for logistic regression. 1. We proceed in a fashion similar to forward selection in steps 1 and 2, and assume that we have selected X1 and X2 so far. At this moment, we perform a backward selection step to check whether we should retain the variables selected earlier. We accomplish this by creating a model with just X1, obtaining its log-likelihood and by LRT comparing it to the full model log-likelihood. 2. In the general case of k ‘entered’ variables, the candidate to remove is that which yields the highest LRT p-value when removed. The comparison is done against the α-to-remove level that marks the threshold level of explanatory power. If the highest p-value is higher than the α-to-remove, the variable is removed; otherwise, it stays in the model. 3. The α-to-remove level must be higher than the α-to-enter to prevent cycles of the same variable entering and leaving the model. Higher α levels, both to enter and to be removed, allow for more variables to remain in the model. 4. The search continues adding and removing variables until a) either all ‘p’ candidate variables have been entered; or b) all the variables in the models have p-values to remove that are less than the α-to-remove level, and those variables not selected have p-values higher than the α-to-enter level.
  • 83. 2019-06-20 Logistic Selection Steps model_name M1_TRN_L OGISTIC_ BACKWAR D M1_TRN_ LOGISTIC _STEPWI SE M1_TRN_L OGISTIC_F ORWARD # in mo del P- value # in m od el P- value # in m od el P-value Step Effect Entered Effect Removed 5 0.7501 num_members no_claims 1 0.000 1 0.000 2 member_duration 2 0.000 2 0.000 3 optom_presc 3 0.000 3 0.000 4 total_spend 4 0.001 4 0.001 5 doctor_visits 5 0.019 5 0.019 Summary of entry/removal by 3 selection methods
  • 84. 2019-06-20 Leonardo Auslender Copyright 2009 2019-06-20 84
  • 85. 2019-06-20 Leonardo Auslender Copyright 2009 2019-06-20 85 Doctor_visits Not removed even though insignificant.
  • 86. 2019-06-20 Leonardo Auslender Copyright 2009 2019-06-20 86
  • 91. 2019-06-20 91 GOF ranks GOF measure rank AURO C Av g Sq uar e Err or Cum Lift 3rd bin Cum Resp Rate 3rd Gini Rsquare Cramer Tjur Rank Ra nk Rank Rank Rank Rank Unw. Mean Unw. Median Model Name 1 1 1 1 2 2 1.33 1.00 01_M1_TRN_LOGISTIC_BACKWAR D 02_M1_TRN_LOGISTIC_FORWARD 2 2 2 2 3 3 2.33 2.00 03_M1_TRN_LOGISTIC_NONE 3 3 3 3 1 1 2.33 3.00 04_M1_TRN_LOGISTIC_STEPWISE 4 4 4 4 4 4 4.00 4.00 GOF ranks GOF measure rankAUROC Avg Square Error Cum Lift 3rd bin Cum Resp Rate 3rd Gini Rsquare Cramer Tjur Rank Rank Rank Rank Rank Rank Unw. Mean Unw. Median Model Name 2 2 2 2 2 2 2.00 2.00 05_M1_VAL_LOGISTIC_BACKWAR D 06_M1_VAL_LOGISTIC_FORWARD 3 3 3 3 3 3 3.00 3.00 07_M1_VAL_LOGISTIC_NONE 1 1 1 1 1 1 1.00 1.00 08_M1_VAL_LOGISTIC_STEPWISE 4 4 4 4 4 4 4.00 4.00
  • 93. 2019-06-20 93 K-S point used as cutoff, mostly in finance.
  • 95. 2019-06-20 Leonardo Auslender Copyright 2009 2019-06-20 95
  • 96. 2019-06-20 Leonardo Auslender Copyright 2009 2019-06-20 96 Model performances are very similar, given data set simplicity. Auroc equivalent across models and data roles. Possible to add non-logistic techniques to obtain wider notion of model performance, in chapter of trees and Ensembles.
  • 97. 2019-06-20 Leonardo Auslender Copyright 2009 2019-06-20 97
  • 98. 2019-06-20 Leonardo Auslender Copyright 2009 2019-06-20 98
  • 99. 2019-06-20 Leonardo Auslender Copyright 2009 2019-06-20 Balanced / Unbalanced target – Rare Event. Typical situation: Binary dependent variable with far fewer ‘1’s than ‘0’s: fraud (less than 1%), extreme diseases, oil spills, wars, decision to run for office, defective products, etc. Logistic regression in this case, for instance, underestimates probability of rare events. Also, tendency to create enormous data bases to contain ‘rares’. Typically, misclassification cost of ‘1’s higher than misclassification cost of ‘0’s. Since classifiers typically aim at maximizing accuracy, 98% ‘0’s is already very good accuracy but it could lead to very poor ‘1’ accuracy. And same for ROC. 99
  • 100. 2019-06-20 Leonardo Auslender Copyright 2009 2019-06-20 Most used methods to deal with Unbalanced samples for most classification methods: 1) Under-sample ‘0’s, 2) 2) over-sample ‘1’s (by re-sampling with replacement). or 3) both. 4) SMOTE does 3 with special over-sampling of ‘1’s. 5) Use cost functions (usually difficult to establish). SMOTE: Over-sample by creating synthetic observations from near ‘1’s clusters (multiply difference between chosen cluster values and original by random value between 0, 1) and add to original value. Bing Zhu (Sichuan University), Bart Baesens (KU Leuven) & Seppe vanden Broucke (KU Leuven) (2017) argue that 50:50 resampling is not necessary when only focusing on gof measures. 100
  • 101. 2019-06-20 Leonardo Auslender Copyright 2009 2019-06-20 101 Typically, correct classification of “rares” has greater value than that of “usuals”. Is estimation affected? In case of rare events, most applications yield smaller probabilities than they should be; but for Y=1, probabilities should be larger (i.e., 0.8 instead of 0.6)  πi(1- πi) (.16 vs .24) and variance (its inverse) is higher for .8 than for .6  additional ‘1’s (to make probability higher) cause variance to drop in further and thus ‘1’s bring in ‘more’ information than ‘0’s (King, Zeng 2001). PROS: Re-sampling is prevailing methodology. CONS: Ad-hoc procedure, different opinions on re-sampling mixture, no analysis of effects on coefficients, drops information.
  • 102. 2019-06-20 Leonardo Auslender Copyright 2009 2019-06-20 Balanced/Unbalanced target, comment. Re-balancing trees (Auslender, 1998, never finished): create samples with respective percentages of 0/1 equal to 45/55, 46/54,,,,,, 54/46, 55/54. Typically observe that upper set of levels is similar or same for all samples (split values and variables) in tree classification (next chapter). Lower layer typically contains similar variables that are split, sometimes in different hierarchical order: variable 1 is split in level 4 in sample 45/55, and in level 5 in sample 50/50, while variable 2 behaves reciprocally. Conclude: top level is core of tree, and middle level still provides strong information. After that, information not reliable. Similar approach possible for logistic regression in context of co-linearity (co-linearity induces instability in the coefficients of linear models). But see Owen next. NB: Balanced/Unbalanced comparison examples in next lecture. 102
  • 103. 2019-06-20 Leonardo Auslender Copyright 2009 2019-06-20 103
  • 104. 2019-06-20 Leonardo Auslender Copyright 2009 2019-06-20 Owen’s 2007 findings and recommendations for logistic. Owen (2007): Infinitely imbalanced cases, where N0 →∞ and N1 fixed. For “given” model (i.e., not variable search model), contribution of X cases when Y = 1 depends entirely on mean of X when Y = 1, by relation: Practical Implication: For non-outlier data in X/Y = 1, mixture of 0/1 does not particularly affect estimation, except intercept that →-∞. If X / Y = 1 clustered, considered splitting analysis by clusters. x 0 0x 0 e xdF (x) X , F : distr. of X given Y 1 e dF (x)       104
  • 105. 2019-06-20 Leonardo Auslender Copyright 2009 2019-06-20 SEPARATION. Separation is observed in fitting process of logistic regression model if likelihood converges to finite value while at least one parameter estimate diverges to (plus or minus) infinity. Separation primarily occured in small or sparse samples with highly predictive covariates for fixed model in days of yore. 105
  • 106. 2019-06-20 Leonardo Auslender Copyright 2009 2019-06-20 More interesting case: groups.google.com/group/MedStats/browse thread/thread3078fd372b83f662 Obs Y x1 x2 S 1 1 29 62 10 2 1 30 83 29 3 1 31 74 18 4 1 31 88 32 5 1 32 68 10 6 2 29 41 -11 7 2 30 44 -10 8 2 31 21 -35 9 2 32 50 -8 Where S = x2 - 2*x1 + 6. Note perfect separation of Y for S > - 8. Also, S perfectly co-linear with X1 and X2 obviously. 106
  • 107. 2019-06-20 Leonardo Auslender Copyright 2009 2019-06-20 107 Canonical Discriminant Analysis. ***
  • 108. 2019-06-20 Leonardo Auslender Copyright 2009 2019-06-20 108 Canonical Discriminant Analysis (CDA). (seen in Cluster Analysis lecture) “Canonical” is the statistical term for analyzing latent variables (which are not directly observed) that represent multiple variables (which are directly observed). CDA is related to PCA and Canonical Correlation Analysis (CCA). Canonical Variate (CV) is weighted sum of the variables in the analysis. CCA is useful in analyzing strength of association between two constructs of continuous variables. CDA finds linear functions of variables that maximally separate the means of groups of observations into two or more groups (given by a nominal target variable), while maintaining variation within groups as small as possible. PCA summarizes total variation. # groups = # nominal levels.
  • 109. 2019-06-20 Leonardo Auslender Copyright 2009 2019-06-20 109 Canonical Discriminant Analysis (CDA). CDA equivalent to canonical correlation analysis between INTERVAL vars and set of dummy variables coded from TARGET. It produces ‘k’ Canonical Discriminant Functions (CDFs) or canonical variables, ‘k’ = min ( # groups – 1, # variables). CDF_1 yields max variation between groups w.r.t. within- group variation  greatest degree of group differences. CDF_2, uncorrelated to CDF_1, group differences not captured by CDF_1. Previously, clustering examples shows2 CDFs, since number of chosen clusters was 3.
  • 110. 2019-06-20 Leonardo Auslender Copyright 2009 2019-06-20 110
  • 111. 2019-06-20 Leonardo Auslender Copyright 2009 2019-06-20 111 Un-reviewed Additional Classification Methods. Linear Discriminant Analysis (especially for multi-categorical dependent variable) but requires normality as strong assumption. Also Quadratic DA, and Flexible DA. Naïve Bayes (also useful for continuous dependent variables). Support Vector Machines Neural Networks (also useful for continuous dep var) Clustering: Reviewed in earlier presentation, can be used as classification tool.
  • 112. 2019-06-20 Leonardo Auslender Copyright 2009 2019-06-20 112 ID3 and C4, C4.5, PART (briefly mentioned in Tree presentation). K-Nearest Neighbor Mixture discriminant analysis (and review in later chapters): Classification and Regression Trees (CART) Bagging CART Random Forest Gradient Boosting
  • 114. 2019-06-20 Basic logistic SAS program. proc logistic data = &data.; model &depvar. (event= ‘&event.’) = &outvars. / outroc = troc; score data=&validation. Out = &val_out outroc= valoutroc selection = none; roc; roccontrast; run; 114
  • 115. 2019-06-20 SAS program: For ROC curve, obtained from proc logistic, see outroc. For auroc, use before “proc logistic” call Ods associations = assoc_out; After proc logistic run, Proc sql noprint; select nvalue2 into :auroc from assoc_out where upcase(label2) = “C”; Quit; %put auroc is &auroc.; 115
  • 117. 2019-06-20 117 1) Explain, technically or non-technically, separation and Quasi-separation. Solutions for the problem, or is it not a problem? Does it apply to other supervised methods? Is it co-linearity? 2) Interactions in logistic as opposed to in linear regression? Do some research. 3) Unbalanced samples, problem in logistic? Does it make a difference whether you have a fixed or a searched model? 4) Accuracy vs. precision?
  • 118. 2019-06-20 118 References Hosmer D., Lemeshow S. (2000): Applied Logistic Regression, Wiley. Owen A. (2007): Infinitely Imbalanced Logistic Regression, JMLR
  • 119. 2019-06-20 Leonardo Auslender Copyright 2009 2019-06-20 The End?? 119