Binary Logistic Regression Classification makes use of one or more predictor variables that may be either continuous or categorical to predict target variable classes. This technique identifies important factors impacting the target variable and also the nature of the relationship between each of these factors and the dependent variable. It is useful in the analysis of multiple factors influencing an outcome, or other classification where there two possible outcomes.
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
What is Binary Logistic Regression Classification and How is it Used in Analysis?
1. Master the Art of Analytics
A Simplistic Explainer Series For Citizen Data Scientists
J o u r n e y To w a r d s A u g m e n t e d A n a l y t i c s
3. Basic Terminologies
Target variable usually denoted by Y , is the variable being predicted and
is also called dependent variable, output variable, response variable or
outcome variable (Ex : One highlighted in red box in table below)
Predictor, sometimes called an independent variable, is a variable that is
being used to predict the target variable ( Ex : variables highlighted in
green box in table below ) Age Marital Status Loan Status Default
58 married no yes
44 single no no
33 married yes yes
47 married no yes
33 single no no
35 married no yes
28 single yes no
4. Introduction
• Objective :
• Logistic regression measures the relationship between the categorical target
variable and one or more independent variables
• It deals with situations in which the outcome for a target variable can have
only two possible types
• Thus , logistic regression makes use of one or more predictor variables that
may be either continuous or categorical to predict the target variable classes
• Benefit:
• Logistic regression model output helps identify important factors ( Xi )
impacting the target variable (Y) and also the nature of relationship between
each of these factors and dependent variable
5. Example : Binary Logistic Regression : Input
Let’s conduct the Binary Logistic Regression analysis on following variables :
Default Status Age Marital Status
Existing Loan
Status
Income
Defaulted 58 married no 46,399
Not Defaulted 44 single no 47,971
Defaulted 33 married yes 52,618
Defaulted 47 married no 28,717
Not Defaulted 33 single no 41,216
Defaulted 35 married no 34,372
Not Defaulted 28 single yes 64,811
Not Defaulted 42 divorced no 53,000
Defaulted 58 married no 41,375
Not Defaulted 43 single no 53,778
Not Defaulted 41 divorced no 44,440
Not Defaulted 29 single no 51,026
Independent variables (Xi)Target Variable (Y)
6. Example : Binary Logistic Regression : Output
Coefficients P value
(Intercept) -2.34 0.00
Age 0.01 0.07
Marital Status (Married) 0.5 0.04
Income 0.1 0.04
Existing loan (Yes) 0.3 0.03
COEFFICIENTS
• P value for marital status, income and existing loan is <0.05 ;
Hence these variables are important factors for predicting likely default/non
default class
• But p value for Age is >0.05 which means Age is not impacting the prediction
significantly
7. Example : Binary Logistic Regression : Output
CLASSIFICATION ACCURACY : (35+ 70) / (35+70+4+4) = 92%
• The prediction accuracy is useful criterion for assessing the model performance
• Model with prediction accuracy >= 70% is useful
CLASSIFICATION ERROR = 100- Accuracy = 8%
There is 8% chance of error in classification
Defaulted Not defaulted
Defaulted 35 4
Not defaulted 4 70
ACTUAL VERSUS PREDICTED
Predicted
Actual
9. SAMPLE OUTPUT 1 : MODEL SUMMARY
Coefficients P value
(Intercept) -2.34 0.00
Age 0.01 0.07
Marital Status (Married) 0.5 0.04
Income 0.1 0.04
Existing loan (Yes) 0.3 0.03
Defaulted Not defaulted
Defaulted 35 4
Not defaulted 4 70
ACTUAL VERSUS PREDICTED
Predicted
Actual
COEFFICIENT MATRIX :
10. Age
Marital
Status
Existing Loan
Status
Income Default Status Predicted class Probability
58 married no 46,399 Defaulted Defaulted 0.7
44 single no 47,971 Not Defaulted Not Defaulted 0.9
33 married yes 52,618 Defaulted Defaulted 0.8
47 married no 28,717 Defaulted Defaulted 0.7
33 single no 41,216 Not Defaulted Not Defaulted 0.6
35 married no 34,372 Defaulted Not Defaulted 0.5
28 single yes 64,811 Not Defaulted Defaulted 0.4
42 divorced no 53,000 Not Defaulted Not Defaulted 0.3
58 married no 41,375 Defaulted Defaulted 0.2
43 single no 53,778 Not Defaulted Defaulted 0.1
Thus, output will contain predicted class column, confusion matrix and classification plot
SAMPLE OUTPUT 2 : PREDICTED CLASS & PROBABILITY
11. SAMPLE OUTPUT 3 : CLASSIFICATION PLOT
• Lesser the overlap between two classes in the plot above , better the
classification done by model
12. INTERPRETATION OF IMPORTANT MODEL SUMMARY STATISTICS
Accuracy:
If Accuracy >= 70% : Model is well fit on provided data and predicted classes
are reasonably accurate
If Accuracy < 70% : Model is not well fit on provided data and predicted classes
are likely to contain high chances of error
Coefficients and p value :
If value of coefficient is positive and p value <0.05 , variable is positively
correlated with target variable
If value of coefficient is negative and p value <0.05 , variable is negatively
correlated with target variable
If p value > 0.05, variable is unimportant in terms of predicting target variable
classes
13. Limitations
It is applicable only when target variable is categorical
Sample size must be at least 1000 in order to get reliable predictions
Binary logistic regression is not suitable when number of classes > 2
Level 1 of the target variable should represent the desired outcome.
i.e. if desired class is yes in response/non response target variable
then Yes has to be recoded into 1 and No into 0
14. General applications
Credit/loan
approval analysis
•Given a list of client’s
transactional
attributes, predict
whether a client will
default or not on a
bank loan
Medical Diagnosis
•Given a list of
symptoms, predict if a
patient has disease X
or not
Rain forecasting
•Based on
temperature,
humidity, pressure
etc. predict if it will be
raining or not
Treatment
effectiveness
analysis
•Based on patient’s
body attributes such
as blood pressure,
sugar, hemoglobin,
name of a drug taken,
type of a treatment
taken etc., check the
likelihood of a disease
being cured
Fraud analysis
•Based on various bills
submitted by an
employee for
reimbursement of
food , travel , medical
expense etc., predict
the likelihood of an
employee doing fraud
15. Use case 1
Business benefit:
•Once classes are assigned, bank will
have a loan applicants’ dataset with
each applicant labeled as
“likely/unlikely to default”.
•Based on this labels , bank can easily
make a decision on whether to give
loan to an applicant or not and if yes
then how much credit limit and
interest rate each applicant is eligible
for based on the amount of risk
involved.
Business problem :
•A bank loans officer wants to predict if
the loan applicant will be a bank
defaulter or non defaulter based on
attributes such as Loan amount ,
Monthly installment, Employment
tenure , Times delinquent, Annual
income, Debt to income ratio etc.
•Here the target variable would be ‘past
default status’ and predicted class
would be containing values ‘yes or no’
representing ‘likely to default/unlikely
to default’ class respectively.
16. Use case 1 : Input Dataset
Customer ID
Loan
amount
Monthly
installment
Annual
income
Debt to
income
ratio
Times
delinquent
Employment
tenure
Past default
status
1039153 21000 701.73 105000 9 5 4 No
1069697 15000 483.38 92000 11 5 2 No
1068120 25600 824.96 110000 10 9 2 No
563175 23000 534.94 80000 9 2 12 No
562842 19750 483.65 57228 11 3 21 Yes
562681 25000 571.78 113000 10 0 9 No
562404 21250 471.2 31008 12 1 12 Yes
700159 14400 448.99 82000 20 6 6 No
696484 10000 241.33 45000 18 8 2 Yes
702598 11700 381.61 45192 20 7 3 Yes
702470 10000 243.29 38000 17 9 7 Yes
702373 4800 144.77 54000 19 8 2 Yes
17. Use case 1 : Output : Predicted Class
Output : Each record will have the predicted class assigned as shown below (Column : Likelihood to default) :
Customer
ID
Loan
amount
Monthly
installment
Annual
income
Debt to
income
ratio
Times
delinquent
Employment
tenure
Past
default
status
Likelihood
to default
1039153 21000 701.73 105000 9 5 4 No No
1069697 15000 483.38 92000 11 5 2 No No
1068120 25600 824.96 110000 10 9 2 No No
563175 23000 534.94 80000 9 2 12 No No
562842 19750 483.65 57228 11 3 21 Yes No
562681 25000 571.78 113000 10 0 9 No No
562404 21250 471.2 31008 12 1 12 Yes Yes
700159 14400 448.99 82000 20 6 6 No No
696484 10000 241.33 45000 18 8 2 Yes Yes
702598 11700 381.61 45192 20 7 3 Yes Yes
702470 10000 243.29 38000 17 9 7 Yes Yes
702373 4800 144.77 54000 19 8 2 Yes No
18. Use case 1 : Output : Class profile
As can be seen in the table above, there are distinctive characteristics of defaulters (Class : Yes ) and
non defaulters ( Class : No ).
Defaulters have tendency to be delinquent, higher debt to income ratio and lower employment tenure
as compared to non defaulters
Hence , delinquency , employment tenure and debt to income ratio are the determinant factors when
it comes to classifying loan applicants into likely defaulter/non defaulters
Class(Likely to
default)
Average
loan
amount
Average
monthly
installment
Average
annual
income
Average debt
to income
ratio
Average
times
delinquent
Average
employment
tenure
No 10447.30 304.87 66467.74 9.58 1.69 16.82
Yes 7521.32 227.43 60935.28 16.55 6.91 4.01
19. Use case 2
Business benefit:
•Given the body profile of a patient and
recent treatments and drugs taken by
him/her , probability of a cure can be
predicted and changes in treatment/drug
can be suggested if required.
Business problem :
•A doctor/ pharmacist wants to predict
the likelihood of a new patient’s disease
being cured/not cured based on various
attributes of a patient such as blood
pressure , hemoglobin level, sugar level ,
name of a drug given to patient, name of
a treatment given to patient etc.
•Here the target variable would be ‘past
cure status’ and predicted class would
contain values ‘yes or no’ meaning ‘prone
to cure/ not prone to cure’ respectively..
20. Use case 3
Business benefit:
•Such classification can prevent a
company from spending unreasonably
on any employee and can in turn save
the company budget by detecting such
fraud beforehand.
Business problem :
•An accountant/human resource
manager wants to predict the
likelihood of an employee doing fraud
to a company based on various bills
submitted by him/her so far such as
food bill , travel bill , medical bill.
•The target variable in this case would
be ‘past fraud status’ and predicted
class would contain values ‘yes or no’
representing likely fraud and no fraud
respectively.
21. Want to Learn
More?
Get in touch with us @
support@Smarten.com
And Do Checkout the Learning section
on
Smarten.com
June 2018