CHAPTER 11 LOGISTIC REGRESSION.pptx

Business Analytics – The Science of Data Driven Decision Making

Logistic Regression
U Dinesh Kumar

CLASSIFICATION PROBLEMS
• Classification problems are an important category of
problems in analytics in which the response variable (Y)
takes a discrete value.
• The primary objective is to predict the class of a
customer (or class probability) based on the values of
explanatory variables or predictors.

Classification Problems
Classification is an important category of problems in which
the decision maker would like to classify the
case/entity/customers into two or more groups.
Examples of Classification Problems:
Customer profiling (customer segmentation)
Customer Churn.
Credit Classification (low, high and medium risk)
Employee attrition.
Fraud (classification of transaction to fraud/no-fraud)
Stress levels
Text Classification (Sentiment Analysis)
Outcome of any binomial and multinomial experiment.

Challenging Classification Problems
•Ransomware
•Anomaly Detection
•Image Classification (Medical Devices, Satellite
images)
•Text Classification

Logistic Regression -
Supervised Learning Algorithm

Logistic Function (Sigmoidal
function)
Logistic regression is a method used to fit a regression model when
the response variable is binary.
Logistic regression uses a method known as maximum likelihood
estimation to find an equation of the following form(Logistic
Function):
log(Y=1) =
log[p(X) / (1-p(X))] = β0 + β1X1 + β2X2 + … + βpXp
where:
Xj: The jth predictor variable(IV)
βj: The coefficient estimate for the jth predictor variable
The formula on the right side of the equation predicts the log
odds of the response variable (DV) taking on a value of 1.

Thus, when we fit a logistic regression model we can use
the following equation to calculate the probability that a
given observation takes on a value of 1:
We then use some probability threshold to classify the
observation as either 1 or 0.
Observations with a probability greater than or equal to
0.5 will be classified as “1” and all other observations will
be classified as “0.”

Note: Logistic regression uses the concept of predictive modeling as
regression; therefore, it is called logistic regression, but is used to classify
samples; Therefore, it falls under the classification algorithm.

Logistic Function (Sigmoid Function)
• The sigmoid function is a mathematical function used to map the predicted
values to probabilities.
• The value of the logistic regression must be between 0 and 1, which
cannot go beyond this limit, so it forms a curve like the "S" form.
• The S-form curve is called the Sigmoid function or the logistic
function.
• Threshold Value : defines the probability of either 0 or 1.
• Values above the threshold value tends to 1, and a value below the
threshold values tends to 0.
Assumptions for Logistic Regression:
• The dependent variable must be categorical in nature.
• The independent variable should not have multi-collinearity.

Type of Logistic Regression
• Binomial: In binomial Logistic regression, there can be
only two possible types of the dependent variables,
such as 0 or 1, Pass or Fail, etc.
• Multinomial: In multinomial Logistic regression, there
can be 3 or more possible unordered types of the
dependent variable, such as "cat", "dogs", or "sheep"
• Ordinal: In ordinal Logistic regression, there can be 3
or more possible ordered types of dependent variables,
such as "low", "Medium", or "High".

Logistic Regression Model Development
STOP
Explore the data
Derive and Analyze
Descriptive Statistics
Pre-process the data –
Divide the data into
training and validation
data
Define functional form
of the relationship
Estimate regression
parameters
Perform Diagnostic
Tests
Model satisfies
diagnostic test
NO
YES

Example 1
The “Default” dataset contains the following
information about 10,000 individuals:
default: Indicates whether or not an individual
defaulted.
student: Indicates whether or not an individual is a
student.
balance: Average balance carried by an individual.
income: Income of the individual.
We will use student status, bank balance, and
income to build a logistic regression model that
predicts the probability that a given individual
defaults.

Descriptive / Exploratory Statistics

Steps in Logistic Regression:
1. Data Pre-processing steps :
• pre-process/prepare the data so that it can be used in the code
efficiently
• extract the dependent and independent variables from the given
dataset( IVs = _______, DV = ________)
• Split the dataset into a training set and test set(75%, 25%)
• Feature scaling

The scaled output is given below :

Create Training and Test Samples : split the dataset into a training set to train the
model on and a testing set to test the model on.
2. Fitting Logistic Regression to the Training set
a. LogisticRegression class from the available library is loaded, a classifier object
is created, and the training dataset is used to fit the model to the logistic
regression
b. Model fit diagnostics are checked to see if the model developed is well
fitted to the training set

The coefficients in the output indicate the average change in log odds of defaulting. For
example, a one unit increase in balance is associated with an average increase of 0.005988 in
the log odds of defaulting.
The p-values in the output also give us an idea of how effective each predictor variable is at
predicting the probability of default:
P-value of student status: 0.0843 ; P-value of balance: <0.0000 ; P-value of income: 0.4304
We can see that balance and student status seem to be important predictors since they have low
p-values while income is not nearly as important.

Assessing Model Fit:
There is no R2 value for logistic regression. Instead, we compute
a metric known as McFadden’s R2, which ranges from 0 to just
under 1. Values close to 0 indicate that the model has no
predictive power.
In practice, values over 0.40 indicate that a model fits the data
very well.
VIF Values:
We can also calculate the VIF values of each variable in the
model to see if multicollinearity is a problem
VIF values above 5
indicate severe
multicollinearity.

Step 4: Use the Model to Make Predictions
use the model to make predictions about whether or not an individual will default based
on their student status, balance, and income

We can use the following code to calculate the probability of default for every
individual in our test dataset:
Step 5: Model Diagnostics
Lastly, we can analyze how well our model performs on the test dataset.
Any individual in the test dataset with a probability of default greater
than 0.5 will be predicted(classified as) defaulter.
However, we can also find the
optimal probability to use to
maximize the accuracy of our
model by using

Using this threshold, we can create a confusion matrix which shows
our predictions compared to the actual defaults:
sensitivity (also
known as the “true
positive rate”)
specificity (also known
as the “true negative
rate”)

#plot the ROC curve
plotROC(test$default, predicted)
ROC (Receiver Operating
Characteristic) Curve which
displays the percentage of true
positives predicted by the
model as the prediction
probability cutoff is lowered
from 1 to 0.

Student
Income
Income
Student

CHAPTER 11 LOGISTIC REGRESSION.pptx

More Related Content

What's hot

Similar to CHAPTER 11 LOGISTIC REGRESSION.pptx

More from UmaDeviAnanth

Recently uploaded

CHAPTER 11 LOGISTIC REGRESSION.pptx

Editor's Notes