This document discusses binary classification and logistic regression. It defines key terms like odds, odds ratios, and probability. It also notes some issues that can arise, such as imbalanced samples, separation, and unclear practices regarding variable selection and dealing with collinearity. Interpretation of coefficients in logistic regression is discussed, specifically that coefficients estimate the log of the odds ratios comparing different levels of the independent variables.
2. 2019-06-20 Leonardo Auslender Copyright 2009
2019-06-20 2
Contents
Present practice in classification methods and interpretation
of probability.
Logistic Regression.
Odds.
Coefficient interpretation and p-values.
Model Performance and Assessment.
Gains Charts, ROC.
Variable Selection.
Issues:
Balanced / unbalanced samples.
Separation.
Canonical Discriminant Analysis
3. 2019-06-20
Present Practice, issues and headaches.
Classification is Data Mining area par excellence. Will focus on
binary targets of events/non-events. Research & applications in:
Clinical data analysis (disease / no-disease, insistence
on odd ratios and logistic regression),
Direct marketing: response/no-response, attrition, etc.
Recommender systems: Interesting / uninteresting.
Fraud, Terrorism, banking
……..
Issues and Headaches (can’t cover them all in this lecture):
-Binary target training/estimation mixture of events/non-events.
-Obfuscating terminology.
-Model comparisons.
-Modeling methodologies confront unexpected issues: co-linearity and separation in
logistic (and neurals?), smooth and non-step response function in trees, etc.
4. 2019-06-20
Present Practice, issues and headaches.
Unclear practices on mixture of 0/1 of target variable for estimation, and
relative cost of misclassification of 0/1.
Obfuscating terminology of ROC, precision, model choice, etc.
Concepts used in classification methods to compare models are derived
from different methodologies (i.e., trees, neurals, etc). Mixture of methods
used in practice.
Unclear practices on separation, co-linearity and variable selection. Co-
linearity likely more of a bete-noire than in linear regression case, and adds
doubts about stability of predicted probability in context of scoring future data
bases.
All models produce predictions in probability or rank probability forms
decision has to be made about cutoff point or not.
In next pages: Events = “1”, non-Events = “0”
(usually 1, -1 in engineering and science applications).
5. 2019-06-20
Meaning of Probability Statements: context
dependent.
Probability measured in interval (0, 1) + methods are
“mechanical”, i.e., context provided by analyst. E.g:
1. Model estimates household has %70 probability of responding to credit card solicitation.
Solicitation cost is minimal and bad feeling if non-responding customer didn’t want to be
solicited, is disregarded. Likely action: solicit.
2. Model estimates that probability that conference ceiling will fall on us right now is 40%. How
many of you will stay until I finish reading this paragraph? Action: Run for your life? Sue the
presenter?
3. DNA matching asserts that probability ( male A is father of baby) = 95%, i.e., 1 in 20 is False
Positive. Action: A is father?
4. The probability of a devastating earthquake in NYC in the next 24 hours is negligible but still
non-zero. Nobody seems to care for this element of probability, why? Would the same apply in
LA with higher probability?
Cost (profit) of implementing/not implementing decision, even if not exactly
quantifiable, is most important in context. CONTEXT, CONTEXT, CONTEXT.
8. 2019-06-20
Binary Dependent Variable – Some Definitions.
1) Standard linear Model Y* = X + , but now Y not continuous:
usual assumptions and k usually unknown. Ergo, estimate Y*?
Assume k = 0 below for ease of exposition.
Let = Pr (Y = 1 / X = x) = Pr (X + ) > 0 = Pr ( > - X) =
= 1 – F (- X), F is CDF of .
Since E () = 0 E (Y*) = X , expanded in terms of Y as:
E (Y) = ∑ i yi = (1) + (1 - ) (0) = = X.
Problem: if Y is 0/1 can take only two values, and therefore is
not normal (do we care?).
1 * , * .
0 .
. ., 'good' student
if GPA >3.
{
if Y k Y continuous
Otherwise
e g
Y
9. 2019-06-20
Binary Dependent Variable – Some Definitions to drop linear model.
If Yi = 1 i = 1 – E (Yi) = 1 - Xi = 1 - i, occurs with prob. i,
if Y = 0 = -, which occurs with probability (1 - ): error not normally but
binomially distributed, and its variance not constant (If = .5 maximum
variance = .25. If . = .01 (in tails) variance = .0099. → 1, variance
→ 0) any predictor affecting mean also affects variance usual linear
model not applicable because model assumes constant variance.
Obviously, estimated in Y* above may lie outside [0;1] est. var. < 0. Still
could rescale predicted values to lie within 0-1 (Foster & Stine, 2004), but
how to compare two models that predict at different values > 1? (rescaling to
(0, 1) is quite ad-hoc.)
More serious reason to drop linear model:
Linear marginal effects: linear model ΔX on prob (Y) is constant regardless
of initial value of prob( Y). E.g.,: ΔX = 1 Δprob (Y) = .1 whether prob (Y) =
.5 or prob (Y) = .93. Intuitively, Δprob (Y) should be larger for prob (Y) closer
to 0.5.
10. 2019-06-20
Contrived example to motivate marginal effects.
Suppose urn with 50 Red balls and 50 Blue balls. The exercise is to determine
percentage of Red balls in urn, and ‘effort’ required to increase probability of
obtaining a red ball.
“Effort” (i.e., X variable) is number of additional Red balls to add to increase
probability. Starting for 50/50 mixture, X = 0 (i.e., not adding red ball yet),
just gives 50%. To reach 51%, 52%, need to solve for X in equation (B
remains at 50, r = # reds, b = # blues):
Let’s see how to raise prob from .5 to 0.95 in jumps of 0.05 by raising r:
*
1
r b prob
prob r
r b prob
Obs b prob r
1 50 0.50 50.000
2 50 0.55 61.111
3 50 0.60 75.000
4 50 0.65 92.857
5 50 0.70 116.667
6 50 0.75 150.000
7 50 0.80 200.000
8 50 0.85 283.333
9 50 0.90 450.000
10 50 0.95 950.000
/
.
not constant,
as in linear regression
prob X
It takes an increasing #
Of red balls to increase
Prob.
11. 2019-06-20
Prob rises from .75 to .8 for red rising from 150 to 200, but going from .90 to .95
requires to add red balls from 450 to 950.
12. 2019-06-20
Binary Dependent Variable – Some Definitions.
For ordinal or nominal dependent variables, coding is arbitrary, and any
transformation gives different results.
transformation of .
Need non decreasing function of X to unit interval. Function usually
chosen is CDF of unit-normal distribution PROBIT method, or
standardized logistic distribution, linear logistic or logit model (more
convenient, no integrals, similar to t-distribution with 7 dfs.). Non-
symmetric: log-log and Complimentary log-log (cloglog used in survival
and interval-censored models).
NB: Purpose is to model probability of occurrence of event.
13. 2019-06-20
Logistic: P1 = 1 / (1 + exp (- X * 3))
Probit: P1 = Inv_norm ( X * 3 * / )
Linear: P1 = .5 + X / 3
Complimentary log-log (extreme value, Gompit, note its
skewness): P1 =
3
In next slide :
Not shown :
log log : log( log( ))
(seldom used because of inappropriate
behavior for <.5)
For which link is "best" see Koenker and Yoon
(2009, Journal of Econometrics).
1- exp(-exp (x * 3))
Different functions.
15. 2019-06-20
Binary Dependent Variable, odds and log odds.
Logistic is easier than probit mathematically.
Note that the logit (‘linear equation’, ) estimates log-odds, not
probability. Log-odds popular in gambling: e.g., odds of 19 to 1 that the ‘house’
will win (equivalent to 95% probability of winning for the house).
π: nonlinear function, requires iterative method of estimation.
NB: Will skip individual parameter inference.
1
[1 ] ,logistic cdf
1
Interpretable as log-odds:
Odds
1
log-odds or logit.log( )
1
X
X
X
X
e
e
e
e
X
X
X
17. 2019-06-20
Example of gambling odds. What is the meaning of 1 / p?
Especially in the case of rare events, that is when the probability of an event is
low, the reciprocal of the probability, 1 / p, provides a 1 in ‘n’ re-scale that
can be informative. For instance, if prob (event) = 0.018, its reciprocal is
about 1 in 552 tries. If the probability distribution is geometric, then when p
= 0.018, you would need 552 flips of a coin to obtain the desired event.
We can relate this topic to odds, which will be useful later on. If odds of winning
a bet are 1.25, and you bet 8, then you get 8 * 1.25 = 10 back when you win
(including the original bet, for a total profit of 2), and nothing in the case of
a loss. The bet would be fair if the probability of winning was 8/ 10 = 0.8,
the reciprocal of which is 1.25.
In the context of survey sampling, the reciprocal of the probability of being
included is called the sampling weight.
24. 2019-06-20
Coefficient interpretation: Odds, Odds-ratios (non-interactive model).
1 2
1 2
( 1 1) 2
2 .
log ( ) when 1 0, 2
ˆ
log( ) log 1 2
ˆ
1,
( / 1 1)
( / 1 1 1)
0,
log ( ) for 1
(
a bX cX
a bx cX
a b x cX
X any value
estimated odds a bX cX
b
Odds e
Odds Y X x
Od
a odds Y X
s Y
X
d X x
odds
OR
ratio Y X
e
e
Odds io XRat
( 1 1) 2
1 2
.04 1.0408 ( ) 4.08% 1.
1 1 ( ) ( 1)%
1)
b
b
b
a b x cX
a bx cX
If b e Odds Y for X
X Odds Y e
e
e
e
25. 2019-06-20
Coeff. interpretation - dummy var – non-interaction model.
α+β
α+β
α+β
Yε{0,1}, Xε{0,1}....logit(Y) = α + βX + ε
e
Pr(Y=1/X=1)= .............Finding odds
1+e
Pr(Y=1/X=1) Pr(Y=1/X=1)
= =e ,,,,, odds X=1.
Pr(Y=0/X=1) 1-Pr(Y=1/X=1)
Pr(Y=1/X=0)
=......................
Pr(Y=0/X=0)
α
α+β
β
α
α+rβ
........=e ,,,,,,,,, odds X=0.
e
Odds Ratio: ratio of odds= =e :
e
More generally: if Xε{r,s},odds X=r: e (similarly for X = s).
Odds ratio
β is the change in log odds due to one unit change of X.
b s-r
of increase of X from r to s is:{e }
26. 2019-06-20
Coefficient interpretation: Odds-ratios (interactive model).
NB: If X2 measured in deviation terms, say from mean eb is
odds-ratio for X2 = mean.
1 2 1 2
2 1 2
( 1 1) 2 ( 1 1) 2
( 1 1) 2 ( 1 1) 2
2
1 2 1 2
ˆ
log( ) . log 1 2 1 2
ˆ
( / 1 1)
( / 1 1 1)
( 1)
a bX cX dX X
a bd cX dx X
a b x cX d x X
a b x cX d x X
b dX
a bx cX dx X
est odds a bX cX dX X
Odds e
Odds Y X x e
Odds Y X x e
e
e
Odds Ratio X e
e
( 1), 2 0
.
b
is NOT odds ratio X except when X
odds ratio is conditional on X values
27. 2019-06-20
Coefficient interpretation: Odds-ratios (interactive model).
( 1 2 1)
( 1 1) 2 ( 1 1) 2
2
1 2 1 2
( 1 1) 2 ( 1 1) 2
1
1 2 1 2
ˆ
log( ) . log 1 2 1 2
ˆ
( 1, 2)
( 1)
( 2)
b c d X X
a b x cX d x X
b dX
a bx cX dx X
a b x cX d x X
c dX
a bx cX dx X
est odds a bX cX dX X
Odds Ratio X X e
e
e
Odds Ratio X e
e
e
Odds Ratio X e
e
( 1, 2)
( 1) * ( 2)
d OR X X
OR X OR X
30. 2019-06-20
Interpretation.
Intercept: exp(-0.3390) is the odds-ratio 0.71, ratio of probability
of fraud versus non-fraud when all predictors are zero, i.e., no
doctor visits, no prescriptions, zero member duration, etc.
Member Duration: 0.993 (odds ratio point estimate) is the odds
ratio and is less than 1 which means that the ratio of probability of
fraud versus non-fraud is less than 1 the longer the member
duration. Odds ratio is constant as long as there is no interaction
term.
Number of claims odds ratio of 2.146 means that the ratio of the
probability of fraud versus non-fraud increases with the number of
claims, as long as there is no interaction term.
Etc.
35. 2019-06-20 Leonardo Auslender Copyright 2009
2019-06-20
Binary Dependent Variable – Log-likelihood. ***
Since observations assumed independent, joint probability is: (and take
logs because it is easier to work with logs):
Y 1 Y
1 n
Log likelihood function
(omitting"i" subscr
Pr(Y ,,,, Y ) (1 ) ,
taking logs, and remembering
that f(Data, ) :
ln( , data) Y ln (1 Y)ln(1 )
ln(
ipt)
, data) Y ln ln(1 )
1
35
36. 2019-06-20
Binary Dependent Variable Model evaluation via Inference.
Nested Approach:
Does model with added predictor/s provide “significantly” more information
about dependent variable than model w/o? Typically, H0: fuller model, H1:
constrained model (NB: no notion of model fit as in R2 of regression).
Compare two nested models:
Log (odds) = + 1x1 + 2x2 + 3x3 + 4x4 (model 1) H0
Log (odds) = + 1x1 + 2x2 (model 2) H1
Three tests:
–Likelihood ratio Test (LRT): balanced between H0 and H1.
–Wald test: starts at H1 and looks for improvement towards H0.
–Score or Lagrange Multiplier (LM) test: starts at H0 and asks whether
movement towards H1 is improvement. 1st derivative of likelihood function is
called score function.
37. 2019-06-20 Leonardo Auslender Copyright 2009
2019-06-20
Model χ2 and Model Comparisons.
Likelihood-ratio test: used to contrast two models, one of which is subset of other
(i.e., some j’s set to zero, called Nested model Inference).
G0 = 2 (log L1 – log L0) χ2 (k), k coefficients set to 0 in formulae (called
Deviance, and sometimes residual deviance):
G0 called -2 LLR or 2 LLR (dependent on order of likelihoods).
Bit of confusion: -2 LLR is also deviance, but specific to model vs. saturated model,
and saturated model has LLR = 0.
Is Likelihood same as probability?
In the case of probability, we know the pdf’s (prob. Distr. Function) and parameter values
and want to know probability of specific event/s. I.e., we know betas ….
In case of likelihood, given specific event/s, estimate pdf and/or model parameters. That’s
why, given data, want to estimate parameters that maximize the probability of the event/s
reflected by the data, for an assumed pdf. We don’t know but estimate Betas …
37
38. 2019-06-20
Score (LM) Test.
Let U(β) be vector d LLR / d β: Gradient vector.
Let H(β) be matrix d2 LLR / d β2 Hessian matrix.
Let I(β) = - H(β) or expected value of - H(β).
Let β0 be MLE estimates under H0.
U’(β0) I(β0) -1 U(β0) ~ χ2(r)
39. 2019-06-20
Example, Y ~ B (n, theta) ***
H0: theta = 0.5, H1: theta = 0.2. Wald based on (0.5 – 0.2), LRT on vertical
distance, LM on slope of line at loglikelihod at theta = 0.2. When loglikelihood
smooth curve, all tests yield same answer.
LOGLIKEL1
-5000
-4000
-3000
-2000
-1000
0
THETA
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
SCORE TEST IS SLOPE AT THETA = 0.2
HALF DEVIANCE
----WALD STATISTIC-------
40. 2019-06-20
Non-nested Model Comparisons:
Akaike Information Criterion (AIC) = Deviance (=-2LRR) + 2k,
k = # model parameters.
BIC (Bayesian information criterion) or SC = Deviance + k log n.
In both cases, smaller better model.
NOTE: in non-parametric models, such as Tree based models, ‘k’
is not fully defined no AIC or BIC fully accepted.
41. 2019-06-20 Leonardo Auslender Copyright 2009
2019-06-20
Somer’s D and Pairs Concordance.
P = # possible pairs observations with different values of Y = n0 n1.
% Concordant = % Pairs / Prob (Y = 1 / Y = 1) > Prob (Y = 1 / Y = 0), nc / P.
% Discordant = % Pairs / Prob (Y = 1 / Y = 1) < Prob (Y = 1 / Y = 0), nd / P.
% Tied: % pairs neither concordant nor discordant, nt / P.
(all pairs discordant) -1 ≤ Somer’s D ≤ 1 (all pairs
concordant) = (nc – nd) / (nc + nd + nt),
Gamma = (nc – nd) / (nc + nd)
Stuart Tau a = (nc – nd) / P
c = .5 (1 + Somer’s D), called ROC more popularly (reviewed below).
41
42. 2019-06-20 Leonardo Auslender Copyright 2009
2019-06-20
Hosmer-Lemeshow fit (2000, p. 148).
Posterior probabilities in logistic regression are calculated from
covariate patterns. In data mining, most patterns have very few
observations, and many potential patterns, are empty.
Assume 5 binary predictors in the model maximum number
of patterns = 2 ^ 5 = 32. Assume only 8 patterns exist (J) in the
data expected values will be small or 0 in most cells. In data
mining, typically n = J. With continuous predictors, number of
patterns is much larger.
42
43. 2019-06-20 Leonardo Auslender Copyright 2009
2019-06-20
Hosmer-Lemeshow fit “decile of risk” (2000, p. 148).
Proposal: “create patterns” by grouping by percentiles of the posterior
predicted distribution. By simulation, if g = 10 percentiles (deciles)
chosen, distributed as chi-square (g – 2) when J = n. Use
Pearson’s χ2 test, small p-values indicate lack of fit.
Problem: Hosmer-Lemeshow test known to fail with
continuous covariates due to large number of potential
patterns.
Besides, known to produce misgiving results due to ties or order of
observations (Bertolini et al, 2000). Test considered obsolete (Hosmer et al,
1997).
DO NOT USE IT.
43
44. 2019-06-20
2 2
2
2
2
2
2
(0)
: 1 [ ] 1 (0)
ˆ( )
(1991) :
(obtained via RSQUARE option in SAS Proc Logistic,
called Rsquare and Ma
Cox and Snell (1989)
Nagelkerke
Brie
x-Rescaled
r's S
R r
cor
esp
e:
ectively).
1
ˆ(
N N
i
L
L Max R
L
R
Max R
P
n
R
2
1
)
n
i
i
Y
What would statisticians do without R2? Create more …
46. 2019-06-20 Leonardo Auslender Copyright 2009
2019-06-20 46
(1) Significant LRT, at least one Beta significantly different
from 0. Analogous to F-test in Linear Regression.
Agreement with (2) and (3). If no agreement, model is in
doubt.
(2) (4) and (5) confirm that intercept-only model is inferior.
(3) In (6), difference between Intercept only model and
Intercept and covariates yields (1).
47. 2019-06-20
Association of Predicted Probabilities and Observed Responses
Percent Concordant 75.4 Somers' D 0.512
Percent Discordant 24.2 Gamma 0.515
Percent Tied 0.4 Tau-a 0.160
Pairs 870504 c 0.756
Partition for the Hosmer and Lemeshow Test
FRAUD = 1 FRAUD = 0
Group Total Observed Expected Observed Expected
1 237 13 10.71 224 226.29
2 237 18 17.08 219 219.92
3 237 15 21.88 222 215.12
4 237 25 26.73 212 210.27
5 237 34 31.53 203 205.47
6 237 36 37.13 201 199.87
7 237 39 43.19 198 193.81
8 237 53 52.97 184 184.03
9 237 84 71.69 153 165.31
10 232 139 143.09 93 88.91
51. 2019-06-20
Misclassification Rates (typically assumes 0.5 cutoff)
Confusion Table.
0: “Negative” (non-event) “1”: Positive (event)
TN: True Negative FN: False Negative
FP: False Positive TP: True positive.
“True”, “False” refer to actual state of nature. “Negative” and
“Positive” refer to prediction.
51
Actual Pr edicted 0 1
0 non event TN FP
1 Event FN TP
52. 2019-06-20
Misclassification Rates, K-S, ROC (cont. 1).
1) Classification accuracy or overall classification rate: (1 –
classification error rate, converse of misclassification rate), “0” and
“1” Classification Rates:
Overall: TP + TN / (TP + TN + FP + FN).
Assumes known and unchanging “natural” class distribution and that error
cost of FP = errors FN. Typically favors majority class; but in most
applications, cost of misclassifying “1” is higher.
“0” and “1” Classification Rate: TN / (TN + FP) ………..…. “0”
TP / (TP + FN) ……….…. “1”
Misleading results: Assume “1” important. Overall (left model) = 92.5%,
Overall (right model) = 97.5%, but right model misses all “1”s.
Predicted 0 Predicted 1 Predicted 0 Predicted 1
Actual 0 180 15 195 0
Actual 1 0 5 5 0 52
53. 2019-06-20
Misclassification Rates, K-S, ROC
2) Event (“1”), non-Event (“0”) and overall Precisions:
PR(1) = TP / (TP + FP) Event Precision
PR(0) = TN / (TN + FN) Non-Event Precision
PR = (TP + TN) / 2 (Theoretical max for PR (i) = 1).
(of those predicted as “1”, proportion of “true” ones …)
3) Area under Receiver Operating Characteristic Curve
(AUROC): Graph of FP rate (x-axis) vs. TP rate (Y-axis) varying
threshold of classification. Laplace estimate is used (in tree algorithms)
because more consistent improvements in ROC curves.
TP rate = TP / (TP + FN) (sensitivity)
FP rate = FP / (FP + TN) ( 1 – specificity)
TN / (FP + TN) specificity
53
54. 2019-06-20
54
Recall, F-1 …
Recall / Sensitivity / TPR: TP / TP + FN. Percentage of
events well classified.
F1-score / F-measure / F-score: harmonic mean of precision
and recall. Requires that R and P not move opposite to each
other. Useful when costs of misclassification between 0 and
1 are very different. If Costs similar and counts of FN ≈ FP,
can use accuracy. Higher F1, better. Also, weighted F1-score
used to give more importance to P or to R.
2 * (recall * Pr ecision)
F1
Re call Pr ecision
Specificity / TNR: TN / TN + FP
55. 2019-06-20
55
Which measure to use?
If costs similar and counts of FN and FP similar and Recall
does not move opposite Precision, use F1. Also, when event
prior much smaller.
Choose Recall if FN is more important than FP. E.g.,
misdiagnosing cancer patient as healthy.
Choose prediction when want to capture as many positives
as possible. E.g., Spam, prefer to miss SPAM detection as
true, rather than sending good message to spam folder.
Specificity if want to concentrate on non-events and avoid
False alarms (FP). E.g., trial verdict.
61. 2019-06-20
Misclassification Rates, K-S, ROC
Heuristically, AUROC = # pairs of observations, one ‘0’ one ‘1’, such that
posterior P(X / Y = 1) > P(X / Y = 0). Max value is 1.
Classification methods typically maximize classification rates
tendency to focus just on these rates, which does not necessarily
provide good balance between events and non-events.
ROC curves can cross, e.g. data
sets with different proportions of
0/1. NW direction better model.
B is preferred to C for low FP
rates, but C preferred over B later
on. A clearly inferior.
“Best” model: max AUROC. But
max AUROC may not be ‘best’ for
specific cost and class
distribution. Plus, based on
classification and not precision
measures (detailed below). 61
62. 2019-06-20
Misclassification Rates, K-S, ROC
ROC:
1) step function built by varying classification threshold.
Different algorithms to ‘smooth’ curves.
2) invariant to any monotonic transformation of posterior
probabilities, because just rankings matter.
62
63. 2019-06-20 Leonardo Auslender Copyright 2009
2019-06-20 63
AUROC comparison: Gradient Boosting better than logistic
For TRN and VAL.
68. 2019-06-20 Leonardo Auslender Copyright 2009
2019-06-20 68
Ideally, curve reaches NE corner for better model, and Gradient
boosting arches NE more than Logistic, equivalent to building PR-AUR
and compare areas (not shown). GB better than LG. Only TRN shown.
69. 2019-06-20 Leonardo Auslender Copyright 2009
2019-06-20 69
Standard recommendation
For balanced data (i.e., prior about 50% and equal costs),
choose most outward ROC curve as model of choice.
Otherwise, use Precision Recall curve. Notice that non-events
affect mostly ROC curve.
NOTE: we included Gradient Boosting (reviewed at later
chapter to emphasize model comparisons.
71. 2019-06-20
Explanation via Example: Experian – White Paper.
Lift or gainschart for credit promotion. Table can be created by decile (10%
intervals) or vingtiles (20%) intervals of posterior prob. of being “good”, in
descending order of prob. More often binned by equal sized bins of observations in
descending order of probability. Max potential lift is 1 / event prior percentage.
When comparing different models by deciles, note that underlying probabilities can
be different across different models (e.g., logistic vs Gradient Boosting).
Cum Good % (Bads) measures cumulative % of all Goods (Bads) captured in
corresponding vingtile. % Cum Diff is corresponding difference between goods and
bads, and its largest difference is value of K-S test.
Rule of thumb used: good model should capture at least 70% goods in 5th decile /
10th vingtile.
Lift: ratio of % events in decile/vingtile to overall event rate (# responders in
decile/vingtile / total # responders). Cum lift: Corresponding cumulative.
K-S (Kolmogorov-Smirnov): prob. point at which [Cum % capture events - non-
events] is highest. Frequently used in finance.
71
76. 2019-06-20
Selecting Cutoff point (remember ‘k’ of first pages?)
Many criteria, but BRAINS are most important however.
KS (Kolmogorov-Smirnof): Probability point at which events and non-events are most
separated cumulatively.
Cumulative Lift: Probability point for selective cum lift value (arbitrary).
Profit/Cost Business decision: Prob point at which Max Profit / Min Cost.
Many others.
BUT:
1) Consider costs of decision: mail piece to wrong person or wrong HIV
treatment?
2) Are categories under study discrete or were they derived from a
continuous scale, such as deciding default as 3 month late payment. If
so, shouldn’t decision take into consideration how late payment was?
3) If probabilities are too close to decision point, shouldn’t we get more
data to decide?
78. 2019-06-20
Lines at 0.3 and
0.207 are different
Cutoffs. Note that
Proportion of true
Events is higher
than prob
Indicated by
Logistic at
about point 0.3.
***
80. 2019-06-20
Forward selection for logistic regression (Hosmer and Lemeshow, 2000)
Assume ‘p’ available predictors. At step 0 (no variables yet entered), fit “intercept-
only model” and denote log-likelihood by L0. Fit ‘p’ univariate logistic regressions
for each predictor, and obtain ‘p’ log-likelihoods, L1,,,,,Lp.
Calculate ‘p’ likelihood ratio tests, Gj = -2 ( L0 – Lj), j = 1…p, and obtain
corresponding p-values pj, such that Prob [χ2(ν) / G]j) = pj, where ν = 1 if Xj is
continuous, and ν = k – 1, if Xj has k categories.
Choose predictor Xj with minimum p-value that is below the entry ‘alpha’ level, which
usually ranges between .15 and .2. Call this predictor Xj0.
In first step above, logistic regression with Xj0 takes previous role of intercept-only
regression, and new search is started with p -1 candidate predictors.
Search is stopped when no p-value is less than the entry alpha level. While α = 5% is
embedded in many practitioners’ activities, it is known that in variable selection that
level is too stringent. It is usually recommended that “α” level for entry be in the
range from 15% to 20% or even higher dependent on the breadth or exploratory
nature of the study.
Notice that at least in Hosmer and Lemeshow (2000), remaining predictors are not
orthogonalized relative to the one just selected (as in variable selection for linear
regression. Orthogonalization obtained when partialing).
81. 2019-06-20
Stepwise Selection for logistic regression.
1. We proceed in a fashion similar to forward selection in steps 1 and 2, and assume
that we have selected X1 and X2 so far. At this moment, we perform a backward
selection step to check whether we should retain the variables selected earlier. We
accomplish this by creating a model with just X1, obtaining its log-likelihood
and by LRT comparing it to the full model log-likelihood.
2. In the general case of k ‘entered’ variables, the candidate to remove is that
which yields the highest LRT p-value when removed. The comparison is done
against the α-to-remove level that marks the threshold level of explanatory power.
If the highest p-value is higher than the α-to-remove, the variable is removed;
otherwise, it stays in the model.
3. The α-to-remove level must be higher than the α-to-enter to prevent cycles of
the same variable entering and leaving the model. Higher α levels, both to enter
and to be removed, allow for more variables to remain in the model.
4. The search continues adding and removing variables until a) either all ‘p’ candidate
variables have been entered; or b) all the variables in the models have p-values to
remove that are less than the α-to-remove level, and those variables not selected
have p-values higher than the α-to-enter level.
96. 2019-06-20 Leonardo Auslender Copyright 2009
2019-06-20 96
Model performances are very similar, given data set
simplicity. Auroc equivalent across models and data roles.
Possible to add non-logistic techniques to obtain wider
notion of model performance, in chapter of trees and
Ensembles.
99. 2019-06-20 Leonardo Auslender Copyright 2009
2019-06-20
Balanced / Unbalanced target – Rare Event.
Typical situation: Binary dependent variable with far fewer ‘1’s
than ‘0’s: fraud (less than 1%), extreme diseases, oil spills,
wars, decision to run for office, defective products, etc. Logistic
regression in this case, for instance, underestimates probability
of rare events. Also, tendency to create enormous data bases
to contain ‘rares’.
Typically, misclassification cost of ‘1’s higher than
misclassification cost of ‘0’s. Since classifiers typically aim
at maximizing accuracy, 98% ‘0’s is already very good
accuracy but it could lead to very poor ‘1’ accuracy. And
same for ROC.
99
100. 2019-06-20 Leonardo Auslender Copyright 2009
2019-06-20
Most used methods to deal with Unbalanced samples for
most classification methods:
1) Under-sample ‘0’s,
2) 2) over-sample ‘1’s (by re-sampling with replacement). or
3) both.
4) SMOTE does 3 with special over-sampling of ‘1’s.
5) Use cost functions (usually difficult to establish).
SMOTE: Over-sample by creating synthetic observations from near ‘1’s clusters
(multiply difference between chosen cluster values and original by random
value between 0, 1) and add to original value.
Bing Zhu (Sichuan University), Bart Baesens (KU Leuven) & Seppe
vanden Broucke (KU Leuven) (2017) argue that 50:50 resampling is not
necessary when only focusing on gof measures.
100
101. 2019-06-20 Leonardo Auslender Copyright 2009
2019-06-20 101
Typically, correct classification of “rares” has greater value than that
of “usuals”.
Is estimation affected?
In case of rare events, most applications yield smaller probabilities
than they should be; but for Y=1, probabilities should be larger (i.e.,
0.8 instead of 0.6) πi(1- πi) (.16 vs .24) and variance (its inverse) is
higher for .8 than for .6 additional ‘1’s (to make probability higher)
cause variance to drop in further and thus ‘1’s bring in ‘more’
information than ‘0’s (King, Zeng 2001).
PROS: Re-sampling is prevailing methodology.
CONS: Ad-hoc procedure, different opinions on re-sampling mixture,
no analysis of effects on coefficients, drops information.
102. 2019-06-20 Leonardo Auslender Copyright 2009
2019-06-20
Balanced/Unbalanced target, comment.
Re-balancing trees (Auslender, 1998, never finished): create samples with
respective percentages of 0/1 equal to 45/55, 46/54,,,,,, 54/46, 55/54. Typically
observe that upper set of levels is similar or same for all samples (split values
and variables) in tree classification (next chapter).
Lower layer typically contains similar variables that are split, sometimes in
different hierarchical order: variable 1 is split in level 4 in sample 45/55, and in
level 5 in sample 50/50, while variable 2 behaves reciprocally.
Conclude: top level is core of tree, and middle level still provides strong
information. After that, information not reliable.
Similar approach possible for logistic regression in context of co-linearity
(co-linearity induces instability in the coefficients of linear models). But see
Owen next.
NB: Balanced/Unbalanced comparison examples in next lecture.
102
104. 2019-06-20 Leonardo Auslender Copyright 2009
2019-06-20
Owen’s 2007 findings and recommendations for logistic.
Owen (2007): Infinitely imbalanced cases, where N0 →∞ and N1 fixed. For
“given” model (i.e., not variable search model), contribution of X cases
when Y = 1 depends entirely on mean of X when Y = 1, by relation:
Practical Implication: For non-outlier data in X/Y = 1, mixture
of 0/1 does not particularly affect estimation, except intercept
that →-∞. If X / Y = 1 clustered, considered splitting analysis by
clusters.
x
0
0x
0
e xdF (x)
X , F : distr. of X given Y 1
e dF (x)
104
105. 2019-06-20 Leonardo Auslender Copyright 2009
2019-06-20
SEPARATION.
Separation is observed in fitting process of logistic regression
model if likelihood converges to finite value while at least one
parameter estimate diverges to (plus or minus) infinity.
Separation primarily occured in small or sparse samples with highly
predictive covariates for fixed model in days of yore.
105
106. 2019-06-20 Leonardo Auslender Copyright 2009
2019-06-20
More interesting case:
groups.google.com/group/MedStats/browse
thread/thread3078fd372b83f662
Obs Y x1 x2 S
1 1 29 62 10
2 1 30 83 29
3 1 31 74 18
4 1 31 88 32
5 1 32 68 10
6 2 29 41 -11
7 2 30 44 -10
8 2 31 21 -35
9 2 32 50 -8
Where S = x2 - 2*x1 + 6. Note perfect
separation of Y for S > - 8. Also, S perfectly
co-linear with X1 and X2 obviously.
106
108. 2019-06-20 Leonardo Auslender Copyright 2009
2019-06-20 108
Canonical Discriminant Analysis (CDA).
(seen in Cluster Analysis lecture) “Canonical” is the statistical
term for analyzing latent variables (which are not directly
observed) that represent multiple variables (which are directly
observed). CDA is related to PCA and Canonical Correlation
Analysis (CCA).
Canonical Variate (CV) is weighted sum of the variables in the
analysis. CCA is useful in analyzing strength of association
between two constructs of continuous variables.
CDA finds linear functions of variables that maximally separate the
means of groups of observations into two or more groups (given by
a nominal target variable), while maintaining variation within
groups as small as possible. PCA summarizes total variation. #
groups = # nominal levels.
109. 2019-06-20 Leonardo Auslender Copyright 2009
2019-06-20 109
Canonical Discriminant Analysis (CDA).
CDA equivalent to canonical correlation analysis between
INTERVAL vars and set of dummy variables coded from
TARGET. It produces ‘k’ Canonical Discriminant Functions
(CDFs) or canonical variables, ‘k’ = min ( # groups – 1, #
variables).
CDF_1 yields max variation between groups w.r.t. within-
group variation greatest degree of group differences.
CDF_2, uncorrelated to CDF_1, group differences not
captured by CDF_1.
Previously, clustering examples shows2 CDFs, since number
of chosen clusters was 3.
111. 2019-06-20 Leonardo Auslender Copyright 2009
2019-06-20 111
Un-reviewed Additional Classification Methods.
Linear Discriminant Analysis (especially for multi-categorical
dependent variable) but requires normality as strong assumption.
Also Quadratic DA, and Flexible DA.
Naïve Bayes (also useful for continuous dependent
variables).
Support Vector Machines
Neural Networks (also useful for continuous dep var)
Clustering: Reviewed in earlier presentation, can be used as
classification tool.
112. 2019-06-20 Leonardo Auslender Copyright 2009
2019-06-20 112
ID3 and C4, C4.5, PART (briefly mentioned in Tree presentation).
K-Nearest Neighbor
Mixture discriminant analysis
(and review in later chapters):
Classification and Regression Trees (CART)
Bagging CART
Random Forest
Gradient Boosting
114. 2019-06-20
Basic logistic SAS program.
proc logistic data = &data.;
model &depvar. (event= ‘&event.’) = &outvars. / outroc
= troc;
score data=&validation. Out = &val_out outroc=
valoutroc selection = none;
roc; roccontrast;
run;
114
115. 2019-06-20
SAS program:
For ROC curve, obtained from proc logistic, see outroc.
For auroc, use before “proc logistic” call
Ods associations = assoc_out;
After proc logistic run,
Proc sql noprint;
select nvalue2 into :auroc from assoc_out where
upcase(label2) = “C”;
Quit;
%put auroc is &auroc.;
115
117. 2019-06-20
117
1) Explain, technically or non-technically, separation and
Quasi-separation. Solutions for the problem, or is it not a
problem? Does it apply to other supervised methods? Is it
co-linearity?
2) Interactions in logistic as opposed to in linear
regression? Do some research.
3) Unbalanced samples, problem in logistic? Does it make a
difference whether you have a fixed or a searched model?
4) Accuracy vs. precision?