2. Introduction
• The most common regression solution for a dichotomous outcome target is the use of
Binary Logistic Regression (BLR). This has been the go-to technique for modeling
outcomes in many industries including insurance, financial and catalogue. The
technique is well-utilized in predicting outcomes such as:
Response to a mailing (yes or no)
Chargeback (yes or no)
Chargeoff (yes or no)
Attrition (yes or no)
• However, there are many other situations where the dependent variable has more
than two possible outcomes. For example, in telecommunications, we may need to know
the type of handset prospects are most likely to choose given a choice of several
models. A classical ‘choice’ problem.
• In the case of customer attrition, we may be interested not only in whether or not
the customer attrites, but also on whether they do so after 6 months, 1 year or 18
months of membership. A classical ‘conditional/survival probability’ problem.
• These types of problems have traditionally been approached by building several
separate binary models, one for each product/category/outcome.
• One of the problems with this approach is that the probabilities from these models
are not necessarily comparable. A prospect may rank highly in several of the models,
how do we determine which product to market to them?
• Another approach has been through Discriminant Analysis techniques but these are
quite unforgiving if various statistical assumptions are unmet (Multivariate normality,
Equality of Variances etc.).
• In this article, I recommend the use of a specialized form of logistic regression called
Multinomial Logit or Generalized Logistic Regression.
3. Generalized Logit Regression
• PROC LOGISTIC in SAS was historically only used to handle binary outcome
problems. Beginning Version 8.2, the procedure was extended to polytomous
dependent variables via Generalized Logit analysis using the LINK=glogit option.
• For a dependent variable with three possible outcomes A,B or C, the procedure will
perform variable selection and create two linear equations to calculate logits. The
Logit for the third category is not produced because it is used as the ‘reference’
category.
For a target (Y) that takes the values A, B or C, we can use Y=C as the reference category
and run the following SAS code:
PROC LOGISTIC DATA=yourdata;
MODEL Y(ref=‘C’) = Var1 var2 Var3 /
SELECTION=stepwise LINK=glogit;
RUN;
The procedure will estimate the following logits from the 3 independent variables (var1,
var2, var3):
Logit_A = intercept + β11Var1 + β12Var2 + β13Var3
Logit_B = intercept + β21Var1 + β22Var2 + β23Var3
Where:
Logit_A = log(P(Y=A)/P(Y=C))
Logit_B = log(P(Y=B)/P(Y=C))
– The predicted probability for each category can be derived from these logits as:
• P(Y=C)=1/(1 + expLogit_A + expLogit_B)
• P(Y=A)= expLogit_A / (1 + expLogit_A + expLogit_C)
• P(Y=B)= 1 - P(Y=A) – P(Y=C)
4. Example
Let’s take a hypothetical example of a Magazine Subscription company
whose business model allows the marketing department to send direct
mail solicitations which have a ‘Bill Me Later’ option. There are three
possible outcomes in response to the mailers:
1)Prospects who enroll in the subscription AND subsequently make the
first payment when it becomes due. We will call this OUTCOME = A.
2)Prospects who enroll in the subscription but fail to make the first
payment. We will call this OUTCOME = B.
3)Non-responders, prospects who do not respond to the solicitation. We
will call this OUTCOME = C.
The objective of the marketing activities is the maximization of Outcome
A. The data science team has been tasked with building response models
that could identity the top prospects most likely to result in Outcome A
so that campaigns can be laser-focused to these prospects..
5. Example
Obs var1 var2 var3 var4 var5 var6 var7 var8 var9 var10 var11 Outcome
1 11 1 10 10 11 2 0.481 0.45849 3 0.70922 11 C
2 12 1 12 8 7 4 0.522 0.58634 4 0.62200 12 A
3 8 2 11 6 10 4 0.485 0.30000 3 0.70000 8 C
4 14 2 11 9 10 2 0.571 0.80000 3 0.50322 14 B
5 13 1 4 4 7 1 0.656 0.66096 2 0.63095 13 B
6 9 2 8 6 4 2 0.621 0.56050 4 0.68550 9 B
7 18 2 11 9 10 2 0.978 0.80000 3 0.35254 18 C
8 7 2 11 9 10 4 0.485 0.43801 3 0.63667 7 C
9 12 0 13 9 7 2 0.587 0.47677 4 0.74105 12 B
10 11 2 12 10 11 4 0.499 0.62370 4 0.58971 11 B
11 11 1 5 6 2 2 0.556 0.77539 1 0.56912 11 C
12 6 2 11 9 10 2 0.529 0.47070 3 0.64353 6 B
The table above is a selection of 12 prospect records from the hypothetical dataset. We
would like to predict the probability of class membership using the 11 independent
variables (Var1-Var11). Our target variable has the 3 outcome classes (A,B and C).
The following code was first ran as a variable selection step. Other variable selection
methods ( Backward or Forward) can be used by changing the SELECTION option. The
significance level for retaining or adding variables can also be controlled via the SLS
option.
PROC LOGISTIC DATA=sample1;
MODEL category(ref=‘C’) = Var1-Var11 /
SELECTION=stepwise SLS=0.0001 LINK=glogit;
RUN;
6. SAS OUTPUT
The LOGISTIC Procedure
Model Information
Data Set WORK.SAMPLE1
Response Variable category
Number of Response Levels 3
Model generalized logit
Optimization Technique Fisher's scoring
Number of Observations Read 20973
Number of Observations Used 20379
Response Profile
Ordered Total
Value category Frequency
1 A 2390
2 B 8051
3 C 9938
Logits modeled use category='C' as the reference
category.
This is the first part of the output
showing the name of our dataset
(SAMPLE1), the target variable
(CATEGORY) and the type of
modeling performed (GENERALIZED
LOGIT).
We can also see the distribution by
category and the group used as the
reference (category=‘C’)
7. Summary of Stepwise Selection
Effect Number Score Wald
Step Entered Removed DF In Chi-Square Chi-Square Pr > ChiSq
1 var6 2 1 959.2276 <.0001
2 var4 2 2 450.8481 <.0001
3 var1 2 3 48.6350 <.0001
4 var8 2 4 18.0758 0.0001
5 var9 2 5 10.8040 0.0045
6 var9 2 4 10.7888 0.0045
Analysis of Maximum Likelihood Estimates
Standard Wald
Parameter category DF Estimate Error Chi-Square Pr > ChiSq
Intercept A 1 -3.0312 0.1428 450.4170 <.0001
Intercept B 1 -2.0812 0.0946 483.7261 <.0001
var1 A 1 -0.0263 0.00635 17.1955 <.0001
var1 B 1 -0.0147 0.00429 11.7130 0.0006
var4 A 1 0.1300 0.0118 122.0528 <.0001
var4 B 1 0.1466 0.00770 361.9967 <.0001
var6 A 1 0.3293 0.0224 215.4427 <.0001
var6 B 1 0.3904 0.0150 677.6613 <.0001
var8 A 1 -0.00841 0.1318 0.0041 0.9491
var8 B 1 -0.3561 0.0880 16.3859 <.0001
Odds Ratio Estimates
Point 95% Wald
Effect category Estimate Confidence Limits
var1 A 0.974 0.962 0.986
var1 B 0.985 0.977 0.994
var4 A 1.139 1.113 1.165
var4 B 1.158 1.141 1.175
var6 A 1.390 1.330 1.452
var6 B 1.478 1.435 1.522
var8 A 0.992 0.766 1.284
var8 B 0.700 0.589 0.832
The stepwise method at the
0.01% level of significance,
selected 4 variables for the
model (Var6, Var4, Var1 and
Var8).
NB: VAR8 is highly significant
in predicting Logit_B but not
Logit_A. We could still retain
this variable or preferably
evaluate predictions on a
validation sample with and
without the variable to
determine if it significantly
affects the bottom line and
model stability.
Based on these, we can now
run the procedure using the
selected variables as follows:
PROC LOGISTIC DATA=sample1;
MODEL category(ref='C') = var1
var4 var6 var8 /LINK=glogit;
RUN;
8. Model Fit Statistics
Intercept
Intercept and
Criterion Only Covariates
AIC 40611.437 39048.985
SC 40627.339 39128.495
-2 Log L 40607.437 39028.985
Testing Global Null Hypothesis: BETA=0
Test Chi-Square DF Pr > ChiSq
Likelihood Ratio 1578.4519 8 <.0001
Score 1534.3289 8 <.0001
Wald 1447.7008 8 <.0001
Type 3 Analysis of Effects
Wald
Effect DF Chi-Square Pr > ChiSq
var1 2 24.2915 <.0001
var4 2 428.3454 <.0001
var6 2 778.7953 <.0001
var8 2 17.1800 0.0002
Analysis of Maximum Likelihood Estimates
Standard Wald
Parameter category DF Estimate Error Chi-Square Pr > ChiSq
Intercept A 1 -3.0998 0.1409 483.8847 <.0001
Intercept B 1 -2.1294 0.0931 522.9083 <.0001
var1 A 1 -0.0267 0.00617 18.7590 <.0001
var1 B 1 -0.0152 0.00416 13.4127 0.0002
var4 A 1 0.1328 0.0117 129.8235 <.0001
var4 B 1 0.1487 0.00761 381.7793 <.0001
var6 A 1 0.3374 0.0222 231.3911 <.0001
var6 B 1 0.3975 0.0148 721.3059 <.0001
var8 A 1 0.0339 0.1283 0.0699 0.7915
var8 B 1 -0.3279 0.0854 14.7368 0.0001
The model is statistically sound
with all the four variables being
highly significant.
With 3 classes in out target variable,
the model produces two logit
equations:
1) Logit_A= -3.0998 –
0.0267var1 + 0.1328var4 +
0.3374var6 + 0.0339var8
2) Logit_B= -2.1294 -
0.0152 var1 + 0.1487var4 +
0.3975var6 - 0.3279var8
An additional step is required to
convert these Logits into
probabilities…
9. DATA probs;
SET sample1;
Logit_A= -3.0998 -0.0267*var1 +
0.1328*var4 + 0.3374*var6 +
0.0339*var8;
Logit_B= -2.1294 - 0.0152*var1 +
0.1487*var4 + 0.3975*var6 -
0.3279*var8;
Pr_C= 1/(1+exp(Logit_A) + exp(Logit_B));
Pr_A= Pr_C * exp(Logit_A) ;
Pr_B= Pr_C * exp(Logit_B) ;
RUN;
Obs category Pr_C Pr_A Pr_B
1 C 0.35865 0.14557 0.49578
2 B 0.31443 0.1523 0.53327
3 C 0.63112 0.09269 0.27619
4 B 0.40749 0.12573 0.46678
5 A 0.30994 0.1454 0.54466
6 A 0.32001 0.13846 0.54153
7 B 0.31631 0.15339 0.5303
8 B 0.56396 0.10476 0.33128
9 A 0.3457 0.13923 0.51507
10 C 0.51582 0.10483 0.37936
11 B 0.40708 0.12802 0.4649
12 C 0.76608 0.05566 0.17826
SAMPLE OUTPUT:
We can now look at the probability of group membership for each record and can also derive other
probabilities of interest such as in the following example where we decide to prioritize response
propensity.
IF:
Pr_A = probability of responding to a direct mail campaign AND making a payment
Pr_B = probability of responding to a direct mail campaign AND NOT making a payment
Pr_C = probability of NOT responding to a direct mail campaign
THEN:
Pr_D = probability of responding to a direct mail campaign is Pr_A + Pr_B
The following SAS code may be used to
convert the Logits into probabilities.
10. Obs Pr_C Pr_A Pr_B Pr_D
1 0.24957 0.1559 0.59453 0.75043
2 0.24957 0.1559 0.59453 0.75043
3 0.25201 0.1577 0.59029 0.74799
4 0.25219 0.16249 0.58532 0.74781
5 0.25465 0.16437 0.58099 0.74535
6 0.25581 0.16052 0.58367 0.74419
…….. …….. …….. …….. ……..
20968 0.83541 0.06025 0.10434 0.16459
20969 0.83541 0.04025 0.12434 0.16459
20970 0.85541 0.04025 0.10434 0.14459
20971 0.85767 0.0393 0.10304 0.14233
20972 0.85767 0.0393 0.10304 0.14233
20973 0.85989 0.03836 0.10175 0.14011
These top records have a high
likelihood that they will
Respond to the campaign AND high
likelihood that they will make payment.
We would therefore target these
for the next mailing.
These lower records have a high
likelihood of NOT responding (Pr_C).
Sorting the data by descending Pr_D and then by descending Pr_A gives us:
11. We can also incorporate profitability measures to our evaluation tables as
follows:
Assume the value of each of the actions is as follows:
– Value of non-payer responder (B) = -$0.10
– Value of non-responder (C) = -$0.07
– Value of a payer (A) = $12.00
The expected value of each prospect can be calculated as:
• Expected Value of Prospect = 12*Pr_A - 0.10*Pr_B - 0.07*Pr_C
All the records are then sorted by the Expected Value and the mailing performed
for prospects that have high expected values. This is an enhancement to the
traditional response model which does not incorporate profitability. This method
would pushing unprofitable prospects who otherwise have high response
propensities to the lower ranks.
12. Conclusion
• There are numerous situations in the analytics world in which we
could be required to address polytomous dependent variables. For
instance, deciding which product to more prominently feature on a
website for each visitor given a set of possible products on a
promotion.
• Generalized Logit Regression provides an easy to implement
solution for calculating comparable scores in such situations. The
approach is very versatile but is currently quite underutilized in
data science. It makes a great and novel addition to the data
scientist’s toolkit.