Business Analytics – The Science of Data Driven Decision Making
Business Analytics – The Science of Data Driven Decision Making
Logistic Regression
U Dinesh Kumar
Business Analytics – The Science of Data Driven Decision Making
CLASSIFICATION PROBLEMS
• Classification problems are an important category of
problems in analytics in which the response variable (Y)
takes a discrete value.
• The primary objective is to predict the class of a
customer (or class probability) based on the values of
explanatory variables or predictors.
Business Analytics – The Science of Data Driven Decision Making
Classification Problems
Classification is an important category of problems in which
the decision maker would like to classify the
case/entity/customers into two or more groups.
Examples of Classification Problems:
Customer profiling (customer segmentation)
Customer Churn.
Credit Classification (low, high and medium risk)
Employee attrition.
Fraud (classification of transaction to fraud/no-fraud)
Stress levels
Text Classification (Sentiment Analysis)
Outcome of any binomial and multinomial experiment.
Business Analytics – The Science of Data Driven Decision Making
Challenging Classification Problems
•Ransomware
•Anomaly Detection
•Image Classification (Medical Devices, Satellite
images)
•Text Classification
Business Analytics – The Science of Data Driven Decision Making
Logistic Regression -
Supervised Learning Algorithm
Business Analytics – The Science of Data Driven Decision Making
Logistic Function (Sigmoidal
function)
Logistic regression is a method used to fit a regression model when
the response variable is binary.
Logistic regression uses a method known as maximum likelihood
estimation to find an equation of the following form(Logistic
Function):
log(Y=1) =
log[p(X) / (1-p(X))] = β0 + β1X1 + β2X2 + … + βpXp
where:
Xj: The jth predictor variable(IV)
βj: The coefficient estimate for the jth predictor variable
The formula on the right side of the equation predicts the log
odds of the response variable (DV) taking on a value of 1.
Business Analytics – The Science of Data Driven Decision Making
Thus, when we fit a logistic regression model we can use
the following equation to calculate the probability that a
given observation takes on a value of 1:
We then use some probability threshold to classify the
observation as either 1 or 0.
Observations with a probability greater than or equal to
0.5 will be classified as “1” and all other observations will
be classified as “0.”
Business Analytics – The Science of Data Driven Decision Making
Note: Logistic regression uses the concept of predictive modeling as
regression; therefore, it is called logistic regression, but is used to classify
samples; Therefore, it falls under the classification algorithm.
Business Analytics – The Science of Data Driven Decision Making
Logistic Function (Sigmoid Function)
• The sigmoid function is a mathematical function used to map the predicted
values to probabilities.
• The value of the logistic regression must be between 0 and 1, which
cannot go beyond this limit, so it forms a curve like the "S" form.
• The S-form curve is called the Sigmoid function or the logistic
function.
• Threshold Value : defines the probability of either 0 or 1.
• Values above the threshold value tends to 1, and a value below the
threshold values tends to 0.
Assumptions for Logistic Regression:
• The dependent variable must be categorical in nature.
• The independent variable should not have multi-collinearity.
Business Analytics – The Science of Data Driven Decision Making
Type of Logistic Regression
• Binomial: In binomial Logistic regression, there can be
only two possible types of the dependent variables,
such as 0 or 1, Pass or Fail, etc.
• Multinomial: In multinomial Logistic regression, there
can be 3 or more possible unordered types of the
dependent variable, such as "cat", "dogs", or "sheep"
• Ordinal: In ordinal Logistic regression, there can be 3
or more possible ordered types of dependent variables,
such as "low", "Medium", or "High".
Business Analytics – The Science of Data Driven Decision Making
Logistic Regression Model Development
STOP
Explore the data
Derive and Analyze
Descriptive Statistics
Pre-process the data –
Divide the data into
training and validation
data
Define functional form
of the relationship
Estimate regression
parameters
Perform Diagnostic
Tests
Model satisfies
diagnostic test
NO
YES
Business Analytics – The Science of Data Driven Decision Making
Example 1
The “Default” dataset contains the following
information about 10,000 individuals:
default: Indicates whether or not an individual
defaulted.
student: Indicates whether or not an individual is a
student.
balance: Average balance carried by an individual.
income: Income of the individual.
We will use student status, bank balance, and
income to build a logistic regression model that
predicts the probability that a given individual
defaults.
Business Analytics – The Science of Data Driven Decision Making
Descriptive / Exploratory Statistics
Business Analytics – The Science of Data Driven Decision Making
Business Analytics – The Science of Data Driven Decision Making
Steps in Logistic Regression:
1. Data Pre-processing steps :
• pre-process/prepare the data so that it can be used in the code
efficiently
• extract the dependent and independent variables from the given
dataset( IVs = _______, DV = ________)
• Split the dataset into a training set and test set(75%, 25%)
• Feature scaling
Business Analytics – The Science of Data Driven Decision Making
The scaled output is given below :
Business Analytics – The Science of Data Driven Decision Making
Business Analytics – The Science of Data Driven Decision Making
Create Training and Test Samples : split the dataset into a training set to train the
model on and a testing set to test the model on.
2. Fitting Logistic Regression to the Training set
a. LogisticRegression class from the available library is loaded, a classifier object
is created, and the training dataset is used to fit the model to the logistic
regression
b. Model fit diagnostics are checked to see if the model developed is well
fitted to the training set
Business Analytics – The Science of Data Driven Decision Making
The coefficients in the output indicate the average change in log odds of defaulting. For
example, a one unit increase in balance is associated with an average increase of 0.005988 in
the log odds of defaulting.
The p-values in the output also give us an idea of how effective each predictor variable is at
predicting the probability of default:
P-value of student status: 0.0843 ; P-value of balance: <0.0000 ; P-value of income: 0.4304
We can see that balance and student status seem to be important predictors since they have low
p-values while income is not nearly as important.
Business Analytics – The Science of Data Driven Decision Making
Assessing Model Fit:
There is no R2 value for logistic regression. Instead, we compute
a metric known as McFadden’s R2, which ranges from 0 to just
under 1. Values close to 0 indicate that the model has no
predictive power.
In practice, values over 0.40 indicate that a model fits the data
very well.
VIF Values:
We can also calculate the VIF values of each variable in the
model to see if multicollinearity is a problem
VIF values above 5
indicate severe
multicollinearity.
Business Analytics – The Science of Data Driven Decision Making
Step 4: Use the Model to Make Predictions
use the model to make predictions about whether or not an individual will default based
on their student status, balance, and income
Business Analytics – The Science of Data Driven Decision Making
We can use the following code to calculate the probability of default for every
individual in our test dataset:
Step 5: Model Diagnostics
Lastly, we can analyze how well our model performs on the test dataset.
Any individual in the test dataset with a probability of default greater
than 0.5 will be predicted(classified as) defaulter.
However, we can also find the
optimal probability to use to
maximize the accuracy of our
model by using
Business Analytics – The Science of Data Driven Decision Making
Using this threshold, we can create a confusion matrix which shows
our predictions compared to the actual defaults:
sensitivity (also
known as the “true
positive rate”)
specificity (also known
as the “true negative
rate”)
Business Analytics – The Science of Data Driven Decision Making
#plot the ROC curve
plotROC(test$default, predicted)
ROC (Receiver Operating
Characteristic) Curve which
displays the percentage of true
positives predicted by the
model as the prediction
probability cutoff is lowered
from 1 to 0.
Business Analytics – The Science of Data Driven Decision Making
Student
Income
Income
Student

CHAPTER 11 LOGISTIC REGRESSION.pptx

  • 1.
    Business Analytics –The Science of Data Driven Decision Making
  • 2.
    Business Analytics –The Science of Data Driven Decision Making Logistic Regression U Dinesh Kumar
  • 3.
    Business Analytics –The Science of Data Driven Decision Making CLASSIFICATION PROBLEMS • Classification problems are an important category of problems in analytics in which the response variable (Y) takes a discrete value. • The primary objective is to predict the class of a customer (or class probability) based on the values of explanatory variables or predictors.
  • 4.
    Business Analytics –The Science of Data Driven Decision Making Classification Problems Classification is an important category of problems in which the decision maker would like to classify the case/entity/customers into two or more groups. Examples of Classification Problems: Customer profiling (customer segmentation) Customer Churn. Credit Classification (low, high and medium risk) Employee attrition. Fraud (classification of transaction to fraud/no-fraud) Stress levels Text Classification (Sentiment Analysis) Outcome of any binomial and multinomial experiment.
  • 5.
    Business Analytics –The Science of Data Driven Decision Making Challenging Classification Problems •Ransomware •Anomaly Detection •Image Classification (Medical Devices, Satellite images) •Text Classification
  • 6.
    Business Analytics –The Science of Data Driven Decision Making Logistic Regression - Supervised Learning Algorithm
  • 7.
    Business Analytics –The Science of Data Driven Decision Making Logistic Function (Sigmoidal function) Logistic regression is a method used to fit a regression model when the response variable is binary. Logistic regression uses a method known as maximum likelihood estimation to find an equation of the following form(Logistic Function): log(Y=1) = log[p(X) / (1-p(X))] = β0 + β1X1 + β2X2 + … + βpXp where: Xj: The jth predictor variable(IV) βj: The coefficient estimate for the jth predictor variable The formula on the right side of the equation predicts the log odds of the response variable (DV) taking on a value of 1.
  • 8.
    Business Analytics –The Science of Data Driven Decision Making Thus, when we fit a logistic regression model we can use the following equation to calculate the probability that a given observation takes on a value of 1: We then use some probability threshold to classify the observation as either 1 or 0. Observations with a probability greater than or equal to 0.5 will be classified as “1” and all other observations will be classified as “0.”
  • 9.
    Business Analytics –The Science of Data Driven Decision Making Note: Logistic regression uses the concept of predictive modeling as regression; therefore, it is called logistic regression, but is used to classify samples; Therefore, it falls under the classification algorithm.
  • 10.
    Business Analytics –The Science of Data Driven Decision Making Logistic Function (Sigmoid Function) • The sigmoid function is a mathematical function used to map the predicted values to probabilities. • The value of the logistic regression must be between 0 and 1, which cannot go beyond this limit, so it forms a curve like the "S" form. • The S-form curve is called the Sigmoid function or the logistic function. • Threshold Value : defines the probability of either 0 or 1. • Values above the threshold value tends to 1, and a value below the threshold values tends to 0. Assumptions for Logistic Regression: • The dependent variable must be categorical in nature. • The independent variable should not have multi-collinearity.
  • 11.
    Business Analytics –The Science of Data Driven Decision Making Type of Logistic Regression • Binomial: In binomial Logistic regression, there can be only two possible types of the dependent variables, such as 0 or 1, Pass or Fail, etc. • Multinomial: In multinomial Logistic regression, there can be 3 or more possible unordered types of the dependent variable, such as "cat", "dogs", or "sheep" • Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types of dependent variables, such as "low", "Medium", or "High".
  • 12.
    Business Analytics –The Science of Data Driven Decision Making Logistic Regression Model Development STOP Explore the data Derive and Analyze Descriptive Statistics Pre-process the data – Divide the data into training and validation data Define functional form of the relationship Estimate regression parameters Perform Diagnostic Tests Model satisfies diagnostic test NO YES
  • 13.
    Business Analytics –The Science of Data Driven Decision Making Example 1 The “Default” dataset contains the following information about 10,000 individuals: default: Indicates whether or not an individual defaulted. student: Indicates whether or not an individual is a student. balance: Average balance carried by an individual. income: Income of the individual. We will use student status, bank balance, and income to build a logistic regression model that predicts the probability that a given individual defaults.
  • 14.
    Business Analytics –The Science of Data Driven Decision Making Descriptive / Exploratory Statistics
  • 15.
    Business Analytics –The Science of Data Driven Decision Making
  • 16.
    Business Analytics –The Science of Data Driven Decision Making Steps in Logistic Regression: 1. Data Pre-processing steps : • pre-process/prepare the data so that it can be used in the code efficiently • extract the dependent and independent variables from the given dataset( IVs = _______, DV = ________) • Split the dataset into a training set and test set(75%, 25%) • Feature scaling
  • 17.
    Business Analytics –The Science of Data Driven Decision Making The scaled output is given below :
  • 18.
    Business Analytics –The Science of Data Driven Decision Making
  • 19.
    Business Analytics –The Science of Data Driven Decision Making Create Training and Test Samples : split the dataset into a training set to train the model on and a testing set to test the model on. 2. Fitting Logistic Regression to the Training set a. LogisticRegression class from the available library is loaded, a classifier object is created, and the training dataset is used to fit the model to the logistic regression b. Model fit diagnostics are checked to see if the model developed is well fitted to the training set
  • 20.
    Business Analytics –The Science of Data Driven Decision Making The coefficients in the output indicate the average change in log odds of defaulting. For example, a one unit increase in balance is associated with an average increase of 0.005988 in the log odds of defaulting. The p-values in the output also give us an idea of how effective each predictor variable is at predicting the probability of default: P-value of student status: 0.0843 ; P-value of balance: <0.0000 ; P-value of income: 0.4304 We can see that balance and student status seem to be important predictors since they have low p-values while income is not nearly as important.
  • 21.
    Business Analytics –The Science of Data Driven Decision Making Assessing Model Fit: There is no R2 value for logistic regression. Instead, we compute a metric known as McFadden’s R2, which ranges from 0 to just under 1. Values close to 0 indicate that the model has no predictive power. In practice, values over 0.40 indicate that a model fits the data very well. VIF Values: We can also calculate the VIF values of each variable in the model to see if multicollinearity is a problem VIF values above 5 indicate severe multicollinearity.
  • 22.
    Business Analytics –The Science of Data Driven Decision Making Step 4: Use the Model to Make Predictions use the model to make predictions about whether or not an individual will default based on their student status, balance, and income
  • 23.
    Business Analytics –The Science of Data Driven Decision Making We can use the following code to calculate the probability of default for every individual in our test dataset: Step 5: Model Diagnostics Lastly, we can analyze how well our model performs on the test dataset. Any individual in the test dataset with a probability of default greater than 0.5 will be predicted(classified as) defaulter. However, we can also find the optimal probability to use to maximize the accuracy of our model by using
  • 24.
    Business Analytics –The Science of Data Driven Decision Making Using this threshold, we can create a confusion matrix which shows our predictions compared to the actual defaults: sensitivity (also known as the “true positive rate”) specificity (also known as the “true negative rate”)
  • 25.
    Business Analytics –The Science of Data Driven Decision Making #plot the ROC curve plotROC(test$default, predicted) ROC (Receiver Operating Characteristic) Curve which displays the percentage of true positives predicted by the model as the prediction probability cutoff is lowered from 1 to 0.
  • 26.
    Business Analytics –The Science of Data Driven Decision Making Student Income Income Student

Editor's Notes

  • #17 Feature Scaling is a technique to standardize the independent features present in the data in a fixed range. It is performed during the data pre-processing to handle highly varying magnitudes or values or units. If feature scaling is not done, then a machine learning algorithm tends to weigh greater values, higher and consider smaller values as the lower values, regardless of the unit of the values. Example: If an algorithm is not using the feature scaling method then it can consider the value 3000 meters to be greater than 5 km but that’s actually not true and in this case, the algorithm will give wrong predictions. So, we use Feature Scaling to bring all values to the same magnitudes and thus, tackle this issue. Min-Max Normalization: This technique re-scales a feature or observation value with distribution value between 0 and 1. X_{new }={X{i}-min (X)} / {max (x)-min (X)} Standardization: It is a very effective technique which re-scales a feature value so that it has distribution with 0 mean value and variance equals to 1. X_{\text {new }}=\frac{X_{i}-X_{\text {mean }}}{\text { Standard Deviation }} Need of Feature Scaling: The given data set contains 3 features – Age, Salary, BHK Apartment. Consider a range of 10- 60 for Age, 1 Lac- 40 Lacs for Salary, 1- 5 for BHK of Flat. All these features are independent of each other. Suppose the centroid of class 1 is [40, 22 Lacs, 3] and the data point to be predicted is [57, 33 Lacs, 2]. Using Manhattan Method, Distance = (|(40 - 57)| + |(2200000 - 3300000)| + |(3 - 2)|) It can be seen that the Salary feature will dominate all other features while predicting the class of the given data point and since all the features are independent of each other i.e. a person’s salary has no relation with his/her age or what requirement of the flat he/she has. This means that the model will always predict wrong. So, the simple solution to this problem is Feature Scaling. Feature Scaling Algorithms will scale Age, Salary, BHK in a fixed range say [-1, 1] or [0, 1]. And then no feature can dominate others.
  • #23 The probability of an individual with a balance of $1,400, an income of $2,000, and a student status of “Yes” has a probability of defaulting of .0273. Conversely, an individual with the same balance and income but with a student status of “No” has a probability of defaulting of 0.0439.
  • #24 This tells us that the optimal probability cutoff to use is 0.5451712. Thus, any individual with a probability of defaulting of 0.5451712 or higher will be predicted to default, while any individual with a probability less than this number will be predicted to not default.
  • #25 We can also calculate the sensitivity (also known as the “true positive rate”) and specificity (also known as the “true negative rate”) along with the total misclassification error (which tells us the percentage of total incorrect classifications): The total misclassification error rate is 2.7% for this model. In general, the lower this rate the better the model is able to predict outcomes, so this particular model turns out to be very good at predicting whether an individual will default or not.
  • #26 Lastly, we can plot the ROC (Receiver Operating Characteristic) Curve which displays the percentage of true positives predicted by the model as the prediction probability cutoff is lowered from 1 to 0. The higher the AUC (area under the curve), the more accurately our model is able to predict outcomes: We can see that the AUC is 0.9131, which is quite high. This indicates that our model does a good job of predicting whether or not an individual will default.