2. Introduction
Used to predict binary outcomes for a given set
of independent variables.
One of the algorithms used for classification as
it contains categorical values.
The name may be a little confusing because it
has โregressionโ in it, but it is actually used for
performing classification as the output is
discrete instead of continuous numerical value.
LOGISTIC REGRESSION 2
3. Explanation
Logistic Regression is a type of statistical model that is used to predict the
probability of a certain event happening. It works by taking input variables and
transforming them into a probability value between 0 and 1, where 0 represents a
low probability and 1 represents a high probability.
For example, imagine you want to predict whether someone will buy a product
based on their age and income. Logistic Regression would take these input
variables and use them to calculate the probability of the person buying the
product.
It's called "logistic" because the transformation of the input variables is done
using a mathematical function called the logistic function, which creates an S-
shaped curve.
Overall, Logistic Regression is a useful tool for making predictions and
understanding the relationship between variables in a dataset.
LOGISTIC REGRESSION 3
4. Example
Imagine itโs been several years since you service your car.
LOGISTIC REGRESSION 4
One day you are wonderingโฆ
If your car will break down in near future or not.
So this is like classification, as we will have answers
either in โYesโ or โNoโ.
Years since service
Probability
of
breakdown
As we can imagine that the no. of years that are on lower side
like 1 year, 2 year, 3 year after the service, the chances of the
car breaking down is very limited.
Here, the dependent variableโs output is discrete.
5. Why not Linear Regression?
Take for example,
You ae given a data of Employee ratings along with
the probability of getting promotion.
If we are going to plot Linear Regression with Yes or
No (considering 0 as No and 1 as Yes) the graph will
certainly be look like this.
In the graph, we can see that the output is either 0
or 1, there is nothing in between as the output is discrete in
this case.
Whereas Employee rating is a continuous number
so there will not be any issue while plotting it on x-axis.
LOGISTIC REGRESSION 5
Employee Rating
Probability
of
getting
pomotion
6. Why not Linear Regression?
LOGISTIC REGRESSION 6
Employee Rating
Probability
of
getting
pomotion
As you can see that the graph doesnโt look very right.
There would be lot of errors and RMSE would be very, very
high. Also, the values of output cannot go beyond 0 or 1.
Therefore, instead of using linear regression, we
need to come up with something different. So, the logistic
model came in picture.
7. Odds of Success
To understand Logistic Regression, letโs talk about the odds of success.
Odds(ฮธ) =
๐๐๐๐๐๐๐๐๐๐ก๐ฆ ๐๐ ๐๐ ๐๐ฃ๐๐๐ก โ๐๐๐๐๐๐๐๐
๐๐๐๐๐๐๐๐๐๐ก๐ฆ ๐๐ ๐๐ ๐๐ฃ๐๐๐ก ๐๐๐ก โ๐๐๐๐๐๐๐๐
or, ฮธ =
๐
1 โ๐
(
๐๐๐๐๐๐๐๐๐๐ก๐ฆ ๐๐ ๐๐๐ก๐ก๐๐๐ ๐๐๐๐๐๐ก๐๐๐
๐๐๐๐๐๐๐๐๐๐ก๐ฆ ๐๐ ๐๐๐ก ๐๐๐ก๐ก๐๐๐ ๐๐๐๐๐๐ก๐๐๐
)
The value of Odds range from 0 to ฮฑ.
The values of probability ranges from 0 to 1.
If p = 0, ฮธ = 0/(1-0) = 0/1 = 0
If p = 1, ฮธ = 1/(1-1) = 1/0 = ฮฑ
LOGISTIC REGRESSION 7
11. Compare Linear regression
and Logistic regression
๏ผ Used to solve Regression problems.
๏ผ The response variable is continuous
in nature.
๏ผ It helps eliminate the dependent
variable when there is a change in the
independent variable.
๏ผ It is a straight line.
๏ผ Used to solve classification
problems.
๏ผ The response variable is categorical
in nature.
๏ผ It helps calculate the possibility of a
particular event taking place.
๏ผ It is a S โ curve. (S = Sigmoid)
LOGISTIC REGRESSION 11
Linear Regression Logistic Regression
12. Compare Linear regression
and Logistic regression
๏ผ Example:
๏ผ Weather Prediction
๏ผ If we need to predict the
temperature of the coming
week.
๏ผ Then it is a continuous
number.
๏ผ Example:
๏ผ Weather Prediction
๏ผ If we are going to predict
whether it would be raining
tomorrow or not.
๏ผ Then it is a discrete value.
๏ผ The predictions will be
either in โYesโ or โNoโ
LOGISTIC REGRESSION 12
Linear Regression Logistic Regression
13. Compare Logistic Regression
and Classification
Logistic regression is a statistical modeling technique used to
analyze and model the relationship between a dependent
variable (binary or dichotomous) and one or more
independent variables.
In logistic regression, the dependent variable is categorical
(i.e., it takes on a limited number of values), but it is
continuous in nature.
The goal of logistic regression is to predict the probability of
an event occurring (i.e., the dependent variable taking a
certain value) based on the values of the independent
variables.
Classification, on the other hand, is a
machine learning task that involves
assigning an input to one of several
predefined categories.
Classification can be thought of as a kind of
prediction problem, where the goal is to
predict the class or category of a given
input.
LOGISTIC REGRESSION 13
Logistic Regression Classification
14. Applications of
Logistic Regression
1. Fraud Detection:
Here, the binary detection
variable will be either โDetectedโ or
โNot detectedโ.
2. Disease Diagnosis:
Here, the outcome will be
either โPositiveโ or โNegativeโ
LOGISTIC REGRESSION 14
3. Emergency Detection:
Here, the binary detection
variable will be either โEmergencyโ or
โNot Emergencyโ.
4. Spam Filter:
Here, the outcome will be
either โSpamโ or โNot Spamโ
15. Logistic Regression
Assumptions
๏ Binary Outcome:
The dependent variable, also known as the outcome variable or response
variable, is binary in nature.
This means that it takes on one of two possible values, typically coded as
0 and 1, or as "success" and "failure", "yes" and "no", "true" and "false",
or some other binary coding.
The logistic regression model is designed to estimate the probability of
the "success" outcome as a function of one or more independent
variables, also known as predictors or covariates.
The logistic function, which transforms a linear combination of the
predictors into a probability between 0 and 1, is used to model the
relationship between the predictors and the outcome.
LOGISTIC REGRESSION 15
16. Logistic Regression
Assumptions
๏ Independence of errors:
Independence of errors or residuals is a critical assumption of logistic
regression.
This means that the error or residual term for each observation in the
dataset should not be related to the error or residual term for any other
observation.
Violation of this assumption can result in biased and inefficient
estimates of the logistic regression parameters, which can lead to
incorrect inferences and predictions.
One way to check for violation of the independence assumption is to
examine the residual plot, which should not show any discernible
patterns or trends over time, across groups, or as a function of the
predicted values.
If violations of independence are detected, this may indicate the need to
consider a different model or to account for correlation or clustering in
the data using more sophisticated methods, such as generalized
estimating equations or mixed-effects models.
LOGISTIC REGRESSION 16
17. Logistic Regression
Assumptions
๏ Linearity of the logit:
Linearity of the logit is a key assumption of logistic regression. This
assumption means that the relationship between the independent
variables and the log-odds of the outcome is linear. In other words, the
effect of the independent variables on the log-odds of the outcome is
constant across the range of the independent variables.
One way to check for linearity is to examine the relationship between
each independent variable and the log-odds of the outcome using a
scatterplot or other graphical method. If there is evidence of non-
linearity, such as a curve or a pattern in the plot, it may be necessary to
consider adding polynomial terms, interaction terms, or other nonlinear
transformations of the independent variables to the model. Alternatively,
if the relationship is complex, a different model may be more
appropriate, such as a generalized additive model or a machine learning
algorithm.
LOGISTIC REGRESSION 17
18. Logistic Regression
Assumptions
๏ No Multicollinearity:
The assumption of no or low multicollinearity among the independent variables
is important in logistic regression. Multicollinearity refers to a situation where
two or more independent variables are highly correlated with each other, which
can lead to problems in the estimation of the model parameters and in the
interpretation of the results.
Multicollinearity can cause unstable and imprecise estimates of the logistic
regression parameters, and may make it difficult to identify which independent
variable(s) are driving the observed effects on the outcome variable. One way to
check for multicollinearity is to calculate the correlation matrix between the
independent variables and look for high correlations (i.e., correlations greater
than 0.7 or 0.8).
If high correlations are detected, several strategies can be used to address
multicollinearity, such as removing one of the correlated variables, combining
the variables into a single index or factor, or using regularization techniques like
ridge regression or lasso regression. It is important to resolve issues related to
multicollinearity in order to ensure accurate and reliable estimates of the logistic
regression parameters.
LOGISTIC REGRESSION 18
19. Logistic Regression
Assumptions
๏ Large Sample Size:
Sample size is an important consideration in logistic regression. A relatively
large sample size is typically required to ensure stable estimates and
adequate statistical power to detect meaningful effects.
The sample size requirements for logistic regression depend on several
factors, such as the number and complexity of the independent variables, the
prevalence of the outcome in the population, and the desired level of
statistical power. As a general rule of thumb, a sample size of at least 10-15
observations per independent variable is often recommended.
If the sample size is too small, the logistic regression model may suffer from
issues such as overfitting, where the model fits the noise in the data instead
of the underlying signal, and underpowered statistical tests, where important
effects may be missed due to insufficient sample size.
In summary, a relatively large sample size is important for logistic regression
to ensure accurate and stable estimates, as well as adequate statistical power
to detect meaningful effects.
LOGISTIC REGRESSION 19
20. Confusion Matrix
๏ถ A confusion matrix is a table used to evaluate the performance of a machine
learning algorithm for classification tasks. It is a square matrix that compares the
actual and predicted values of a classifier.
๏ถ Let's consider an example of a binary classification problem where we have a
dataset of 100 patients with diabetes, and we want to build a model that can
predict whether a patient has diabetes or not based on their medical data. The
model output will be either "Positive" or "Negative".
๏ถ By examining the values in the confusion matrix, we can calculate various
performance metrics, such as accuracy, precision, recall, and F1-score, which can
help us evaluate the model's performance. The confusion matrix provides a clear
and concise way of visualizing the model's performance in terms of its ability to
correctly classify positive and negative cases.
LOGISTIC REGRESSION 20
21. Confusion Matrix
๏ถ The values in the confusion matrix are as
follows:
๏ฑ True Positives (TP): the number of
cases that were correctly classified as
positive (60 in this case).
๏ฑ False Positives (FP): the number of
cases that were incorrectly classified
as positive (15 in this case).
๏ฑ True Negatives (TN): the number of
cases that were correctly classified as
negative (15 in this case).
๏ฑ False Negatives (FN): the number of
cases that were incorrectly classified
as negative (10 in this case).
LOGISTIC REGRESSION 21
Suppose the model has
made predictions on the
test set and we have the
following results:
Predicted
Positive
Predicted
Negative
Actual
Positive 60 10
Actual
Negative 15 15
Here, we have a 2x2 matrix, where the rows represent
the actual values and the columns represent the
predicted values. The diagonal elements of the matrix
represent the correctly classified cases, and the off-
diagonal elements represent the incorrectly classified
cases