SlideShare a Scribd company logo
Regression
Agenda for Today’s Session
WHAT IS REGRESSION
We will understand the Regression and
will move onto the types.
REGRESSION DIAGNOSTICS
Lets find what best works with Linear
Regression
2
LINEAR REGRESSION vs LOGISITIC
Lets find out the difference between
Linear & Logistic Regression.
UNDERSTANDING LINEAR REGRESSION
ALGORITHM ASSUMPTIONS
Lets Look into the Assumptions &
Violations
UNDERSTANDING LOGISTIC REGRESSION
Lets Look into the Algorithm & understand how it
functions.
INTERVIEW QUESTIONS
Covering some interview questions to strengthen
the knowledge and give an idea about Interviews
too.
Lets Dive In
3
What is
Regression
4
Understanding Linear Regression Algorithm
5
Establishes a relationship between the Independent
& Dependent Variables.
Examples of Independent & Dependent Variables:-
• x is Rainfall and y is Crop Yield
• x is Advertising Expense and y is Sales
• x is sales of goods and y is GDP
Here x is Independent Variable & Y is Dependent Variable
Intro
How it Works
• Regression analysis is used to understand which among the
Independent Variables are related to Dependent Variables.
• It attempts to model relationship between two variables by fitting a
line called Linear Regression Line.
• The case of Single variable is called Simple Linear Regression
where as the case of Multiple Independent Variables, it is called
Multiple Linear Regression
6
Single Linear Regression Vs Multiple Linear Regression
The Linear Regression line is created using Ordinary Least Square Method.
X Y
Simple Linear Regression
Multiple Linear Regression
X1
Y
X2
X3
X4
Multiple
Predictors
Linear Regression Equation
7
y = mx + c
Slope/Gradient
Y Intercept
Sum of Squared Error
What is Error?
• Actual Value – Predicted Value is called Error
• Here Predicted Value is the value predicted by
the Linear Regression Model.
• Also known as Residual.
Why it is Important?
Smaller the residuals, more accurate model it
would be.
Regression Line | Best Fit Line
What is the Line of Best Fit?
• The Best Fit Line is the line that gives the minimum SSE.
• Amongst all the possible lines, there will be one line that will be the best fit
meaning greatest possible accuracy as a model.
• The line that minimizes the sum of squared error of residuals is called
Regression Line or the Best Fit Line.
• In Simple Terms, it represents a straight line that best represents the data on
scatterplot. It is drawn using the Least Square Method.
Use the Least Square Method to Determine the Eqn. of Line of Best Fit
x 8 2 11 6 5 4 12 9 6 1
y 3 10 3 6 8 12 1 4 9 14
Finding Best Fit Line Algorithm
How to find the Best Fit Line?
• The equation of the straight line is given by y = mx +c
m – slope of the line
c - Intercept (The point at which the straight line touches y axis.
• The Best Fit line is found basis the Least Squared Method.
Algorithm:
Step1: Find the Mean of x-values and y-values
Step2: Calculate the slope of the line. It can be found using the following eqn.
on the right.
Step3: Compute the y-intercept of the line using the formula
Mean of x and y values
Finding m (Slope)
Regression Line | Best Fit Line
Use the Least Square Method to Determine the Eqn. of Line of Best Fit
Steps: Following Steps are deployed to achieve the objective.
Step1: Calculate the Mean of X and Ys
Step2: Find the Following:-
Regression Line | Best Fit Line
Step1: Calculate the Mean of X and Ys
The mean value of x is 6.4 and y is 7
Step2: Find the m (slope)
m (slope) = -131/118.4 = -1.1 approx.
Step3: Calculate the y intercept
b (intercept) = 7 – (-1.1) * 6.4 = 14.0 approx.
Thus, the equation of the line is y = -1.1x + 14
Regression Equation and Line of Best Fit
R Squared
1. R Squared is a statistical measure
that represents the proportion of the
variance for a DV explained by IV.
2. While correlation defines the strength,
R Squared explains up to what extent
the variance of one variable explains
the variance of another var.
3. Example – In Investing, the R
Squared is %age of a fund movement
that can be explained by the
movement of benchmark(sensex)
4. Aka Coefficient of Determination.
Coefficient of Determination
How to see if the Assumptions are Violated – Deciding if Linear Model
is a good Fit
16
Residual vs Fitted Values Plot
1. The x-axis has the fitted values and
y axis has the Residuals.
2. Residual = Observed y value –
Predicted y value.
3. Vertical distance of actual point vs
line of the best fit is called Residual.
4. If unsure about the shape (curve) for
regression equation by looking into
the scatterplot, a residual plot helps
in making decision.
When a pattern is observed in a residual plot,
a linear regression model is probably not appropriate for your data.
Data should be randomly
scattered around line 0
Normal Q-Q Plot (Quantile Quantile Plot)
17
1. If the data is normally distributed, the points in the QQ-normal plot lie on a straight
diagonal line.
2. Greater the departure from this reference line, the greater the evidence that the data
is not following the normal distribution pattern.
Interview Questions
18
01. Which of the following is true about Residuals?
A) Lower is Better
B) Higher is Better
C) A or B depends on the situation.
D) None of these
Interview Questions - Solution
19
01. Which of the following is true about Residuals?
A) Lower is Better
B) Higher is Better
C) A or B depends on the situation.
D) None of these
Solution A: Residuals refer to the error value of the model. Hence, Lower is Better. (
Residual = y – yhat)
Interview Questions
20
02. Which of the statement is true regarding residuals in regression
A) Mean of the Residuals is always zero.
B) Mean of the Residuals is always less than zero.
C) Mean of the Residuals is always more than zero.
D) There is no such rule for residuals.
Interview Questions - Solution
21
02. Which of the statement is true regarding residuals in regression
A) Mean of the Residuals is always zero.
B) Mean of the Residuals is always less than zero.
C) Mean of the Residuals is always more than zero.
D) There is no such rule for residuals.
Solution: A
Sum of residual in regression is always zero. It the sum of residuals is zero,
the ‘Mean’ will also be zero.
Interview Questions
22
03. To Test linear relationship of y (dependent) and x (independent) continuous
variable, which of the following plots are best suited.
A) Scatterplot
B) Barplot
C) Histograms
D) None of These.
Interview Questions - Solution
23
03. To Test linear relationship of y (dependent) and x (independent) continuous
variable, which of the following plots are best suited.
A) Scatterplot
B) Barplot
C) Histograms
D) None of These.
Solution: A
To test the linear relationship between continuous variables Scatter plot is a
good option. We can find out how one variable is changing w.r.t. another
variable. A scatter plot displays the relationship between two quantitative
variables.
Interview Questions
24
04. A Correlation between the age and health of a person found to be -1.09. On the basis of
this you would tell the doctors that:
A) The age is good predictor of health
B) The age is poor predictor of health
C) None of These.
Interview Questions - Solution
25
04. A Correlation between the age and health of a person found to be -1.09. On the basis of
this you would tell the doctors that:
A) The age is good predictor of health
B) The age is poor predictor of health
C) None of These.
Solution: C
Correlation coefficient range is between [-1 ,1]. So -1.09 is not possible.
Interview Questions
26
05. Which of the following offsets, do we use in case of least square line fit?
Suppose horizontal axis is independent variable and vertical axis is dependent variable.
A) Vertical Offset
B) Perpendicular Offset
C) Both
D) None of the Above.
Interview Questions - Solution
27
05. Which of the following offsets, do we use in case of least square line fit?
Suppose horizontal axis is independent variable and vertical axis is dependent variable.
A) Vertical Offset
B) Perpendicular Offset
C) Both
D) None of the Above.
Solution: A
We always consider residual as vertical offsets. Perpendicular offset are useful in case
of PCA.
28
Linear Regression – Model Assumptions
Since Linear Regression assesses whether one or more predictor variables explain the dependent
Variable and hence it has 05 assumptions:
1. Linear Relationship
2. Normality
3. No or Little Multicollinearity
4. No Auto Correlation in errors
5. Homoscedasticity
Note on sample size – The sample size thumb rule is that regression analysis requires at least 20 data
points per independent variable in analysis.
29
1. Check for Linearity
1. Linear Regression needs the
relationship between the independent &
dependent variable to linear & additive.
2. Being additive means the effect of x on y
is independent of other variables.
3. The linearity can be checked using the
scatter plots.
4. Some examples are shown on right. It
shows little to no correlation
Transforming Variables to Achieve Linearity
30
• Each Row shows a different transformation method.
• Transform column shows the method of transformation to be applied on DV or IV.
• Regression equation is the equation used in analysis.
• Last Column shows the equation of Prediction.
Non Linear to Linear Conversion
31
The best transformation depends of the data & the best model will give the highest
coefficient of Determination.
Steps Involved:-
1. Create Linear Regression Model.
2. Construct a residual plot
3. If the plot is random, don’t transform the data.
4. Compute the Coefficient of Determination (R2)
5. Choose a Transformation method as mentioned in table in previous slide.
6. Transform IV or DV or both.
7. Apply Regression
8. If the Transformed R2 is greater than the previous score, the transformation is a
success.
Transformation Example
32
Objectives:
1. Create Linear Regression Model in Excel or R.
2. Find the Linear Regression Equation
3. Make Predictions
4. Create a Residual Plot and Find if Linear
Regression is an Absolute Fit or not.
5. If Not, Try Transformation 0
10
20
30
40
50
60
70
80
0 2 4 6 8 10
Scatterplot Y Vs X
33
2. Check for Normality
1. Linear Regression requires all the
variables need to be normal.
2. The normality can be checked using the
histogram or Q-Q Plots.
3. Test for Normality aka goodness of fit test
is called Kolmogorov Smirnov Test or
Shapiro Wilk Normality Test
4. If the Data is not normal a non linear
transformation ( e.g. Log Transformation)
can fix the issue.
5. Normality means that Y values are
normally distributed to each X.
34
3. Check for Multicollinearity
1. It means that the predictors are correlated with each other. Presence of correlation in
independent variables lead of Multicollinearity.
2. What happens if variables are correlated - it becomes difficult for the model to determine
the true effect of Independent Variables on Dependent.
3. Measure of Multicollinearity is given by VIF (Variable Influence Factor)
1. VIF tells us if the predictors are correlated, how much variance of an estimated coefficient
increases. If no factors are correlated, VIF will be 1.
2. If VIF is 1 - No Multicollinearity.
3. VIF>1, the predictors may be correlated.
4. VIF between 5 & 10 – Indicates high correlation.
5. VIF >10: Regression coefficients are poorly estimated due to Multicollinearity.
Solution:
1. To drop the variable showing high Collinearity. The presence of C suggests that the information
provided by this variable for the DV is redundant and is of no use.
2. Another Approach is to combined the collinear variables and create new predictor (for e.g. taking
average).
Plots Showing Heteroscedasticity
35
36
4. Heteroscedasticity
37
Image Source: Google
Meaning that Data has different dispersion. In other terms, it is called with unequal scatter.
Why it is a Problem
It is a problem because OLS Regression assumes that all residuals are drawn from
population that has constant variance ( Homoscedasticity). Ruins Results. Gives Biased
Coefficients.
Heteroscedasticity, also spelled heteroskedasticity, occurs more often in datasets that
have a large range between the largest and smallest observed values.
A classic example of heteroscedasticity is If you model household consumption based on
income, you’ll find that the variability in consumption increases as income increases.
Breusch-Pagan / Cook Weisberg Test - This test is used to determine presence of heteroskedasticity. If
you find p < 0.05, you reject the null hypothesis and infer that heteroskedasticity is present.
Scale Location Plot
38
• This plot is also useful to determine heteroskedasticity. Ideally, this plot shouldn't show any pattern.
• Presence of a pattern determine heteroskedasticity.
• Don't forget to corroborate the findings of this plot with the funnel shape in residual vs. fitted values.
•
Leverage Plot
39
Leverage: An observation with an extreme value on a predictor variable is called a point with high
leverage. Leverage is a measure of how far an observation deviates from the mean of that variable.
These leverage points can have an effect on the estimate of regression coefficients.
Summary of Tests in Python for Linear Regression Assumptions
40
Multicollinearity Test
from statsmodels.stats.outliers_influence
import variance_inflation_factor
Variance Inflation Factor
Normality Test
from scipy.stats import shapiro
Shapiro Wilk Test
Jarque Bera Test
Autocorrelation Test
Durbin Watson Test
Breusch Pagan Test
Heteroscedasticity Test
import statsmodels.stats.api as sts
Goldfeld Quandt Test
Breusch Pagan Test
Non Linearity Test
import statsmodels.stats.api as sts
Linear Rainbow Test
Auto Correlation of Residuals
41
Auto Correlation of Errors means that the errors are Correlated.
Assumption is that the Linear Regression Model Residuals are Not Correlated.
Test of Assumption – Durbin Watson Test
Package - statsmodels.stats.stattools.durbin_watson(resids, axis=0)
What is the Null Hypothesis
The Null Hypothesis of the test is that there is no serial correlation.
Statistics (Always between 0 and 4)
• The test statistic is equal to 2*(1-r) where r is the sample autocorrelation of the
residuals.
• Thus for r==0 indicating no serial correlation, the test statistic equals 2.
• Closer to 0, more evidence for positive serial correlation and closer to 4 indicates
negative serial correlation.
Assessing Goodness of Fit - R2
42
After fitting the model, it becomes essential to understand how well the model fits the data.
When the Model Fits Best on the Data?
A Model fits the data well if the difference between the actual value and the model’s predicted value is
small and unbiased.
What is R-Squared (R2)?
It is a statistical measure of how close the data is to the fitted regression line. The definition of R-
squared is fairly straight-forward; it is the percentage of the response variable variation that is
explained by a linear model. Or:
R-squared = Explained variation / Total variation
R-squared is always between 0 and 100%:
0% indicates that the model explains none of the variability of the response data around its mean.
100% indicates that the model explains all the variability of the response data around its mean.
In general, the higher the R-squared, the better the model fits your data.
Interview Questions
43
01. True-False: Linear Regression is a supervised machine learning algorithm.
A) TRUE
B) FALSE
Interview Questions - Solution
44
True-False: Linear Regression is a supervised machine learning algorithm.
A) TRUE
B) FALSE
Solution A: Yes, Linear regression is a supervised learning algorithm because it uses true labels
labels for training. Supervised learning algorithm should have input variable (x) and an
variable (Y) for each example.
Interview Questions
45
02. Which of the following methods do we use to find the best fit line for data in Linear
Regression?
A) Least Square Method
B) Maximum Likelihood
C) Both A and B
Interview Questions - Solution
46
02. Which of the following methods do we use to find the best fit line for data in Linear
Regression?
A) Least Square Method
B) Maximum Likelihood
C) Both A and B
Solution - A: In Linear Regression, we use the Least Square Method to identify the
Best Fit Line.
Interview Questions
47
03. Which of the following evaluation metrics can be used to evaluate a model while
modelling a continuous output variable?
A) AUC - ROC
B) Accuracy
C) Mean Squared Error
Interview Questions - Solution
48
03. Which of the following evaluation metrics can be used to evaluate a model while
modelling a continuous output variable?
A) AUC - ROC
B) Accuracy
C) Mean Squared Error
Solution - Since Linear Regression gives the output as continuous values and hence
we use Mean Squared Error metric to evaluate the model performance. Rest of the
options are used in case of classification problem.
Interview Questions
49
05. Which of the following statements is true about outliers in Linear Regression
A) Linear Regression is sensitive to outliers
B) Linear Regression is not sensitive to outliers
C) No Idea.
D) None of these
Interview Questions
50
05. Which of the following statements is true about outliers in Linear Regression
A) Linear Regression is sensitive to outliers
B) Linear Regression is not sensitive to outliers
C) No Idea.
D) None of these
Solution A: The slope of regression line will change if outliers are present in the data.
Therefore, it is sensitive to the Outliers.
Linear Regression Example
51
Objectives:
1. Create Linear Regression Model in Excel or R.
2. Find the Linear Regression Equation
3. Make Predictions
4. Create a Residual Plot and Find if Linear Regression is an Absolute Fit or not
x
1
2
3
4
5
Y
2
1
3.5
3
4.5
2
1
3.5
3
4.5
y = 0.7x + 0.7
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
0 1 2 3 4 5 6
Scatterplot X & Y
Assessing Model Fit
52
Residuals
The distance of the point (Actual Line) with the Line (Line of Prediction)
Root Mean Squared Error (RMSE)
“Residual Standard Error” in Linear Model Output. It is interpreted as how far on an
average, the residuals are from zero.
Mean Absolute Error(MAE)
Mean Absolute Error is another metric to evaluate the model. For example actual y is 10
and predictive y is 30. the resultant MAE would be (30-10) = 20. MAE is very robust
against the effect of Outliers.
R Square
This metric explains the percentage of variance in the model. It ranges between 0 and 1.
A higher value is always appreciated.
Since R Square increases as the new data is introduced (variables) regardless of the
fact that new variable is actually adding new information to the model. To overcome this,
we look upto Adj. R Square which is steady and only inc or dec if the newly added
variable is truly useful.
Adjusted R Square
Difference Between Linear and Logistic Regression
53
Linear Regression Logistic Regression
The Data is Modelled using a Straight Line
A statistical model that predicts the
probability of an outcome that can have two
values
The Outcome (Dependent Variable) is
continuous in Nature
The Outcome (Dependent Variable) has only
limited no of possible values
Output Variable is continuous Output Variable is Discrete
Used to Solve Regression Problems
Used to solve the classification problems
(Binary Classification)
Estimate the Dependent Variable when there
is a change in Independent Variable
Calculates Probability of Occurrence of an
Event
Linear regression uses ordinary least
squares method to minimise the errors and
arrive at a best possible fit
Logistic Regression uses maximum
likelihood method to arrive at the solution
Uses a Straight Line Uses a S Curve or Sigmoid Function
Example - Predicting Sales, House Prices,
GDP etc
Predicting if email is Spam or not, credit card
transaction is fraud or not or customer will
buy the product or not
Logistics
Regression
54
Logistics Regression
55
• Logistics Regression is used when the dependent variable is categorical.
• The values are strictly in the range of 0 and 1.
• It is used to describe data and to explain the relationship between one dependent
binary variable and one or more nominal, ordinal, interval or ratio- level independent
variables.
Logistics Regression Equation
56
• Here g() is the link function. The role of link function is to ‘link’ the expectation of y to linear predictor.
• E(y) is the expectation of target variable
• α + βx1 + γx2 – linear predictors. (α,β,γ to be predicted).
Fundamental equation of generalized linear model:
g(E(y)) = α + βx1 + γx2
Lets take a simple linear regression equation with dependent variable enclosed in a link function:
g(y) = βo + β(Age)
Here g() function is trying to establish probability of success (p) or probability of failure(1-p)
Criteria for p
• It must always be positive (p>=0)
• It must always be less than equal to 1 (p<=1)
p = exp(βo + β(Age))
Since probability must always be positive, we’ll put the linear equation in exponential form.
For any value of slope and dependent variable, exponent of this equation will never be negative.
Logistics Regression Equation
57
p = exp(βo + β(Age))
In order to make the probability less than 1, we must divide p by a number greater than p.
p = exp(βo + β(Age)) / exp(βo + β(Age))+1
Since g(y) = βo + β(Age) and p = exp(βo + β(Age)), this gives a new equation called
p = e^y / 1+ e^y
This is the Logit Function.
The probability of success is given by p = e^y / 1+ e^y and probability of failure is given by
q = 1-p or q = 1 - e^y / 1+ e^y
On dividing both the equations, we get the following:
Or
Logistics Regression equation
58
Final Equation
Log(p/1-p) is the link function and (p/1-p) is the odd ratio. If the log of odd ratio is found positive,
The probability of success is always more than 50%.
Sigmoid Function
59
• The Sigmoid Function also called Logistics Function
gives S shape that can take any real value and map
into a value between 0 and 1.
• The range of the values is between 0 and 1.
• If the output of the sigmoid function is more than
0.5, we classify the outcome as 1 or Yes.
• If the output of the sigmoid function is less than 0.5,
we classify the outcome as 0 or No.
ROC Curve
60
• ROC Is a probability curve and AUC represents degree or measure of separability.
• Higher the AUC better the model is.
• Here TPR is called the True Positive Rate aka Recall or Sensitivity and is given by TP /
(TP+FN).
• Specificity = TN / (TN + FP) and FPR (False Positive Rate) is 1 – Specificity.
What is a Confusion Matrix
Image Source: Google
Y
Actual
Values
Y hat (Predicted Values)
Lets Say we are predicting the presence of disease
which means yes – they have disease and no means –
they don’t.
1. The classifier made total 165 predictions.
2. Out of those 165 cases, the classifier predicted "yes"
110 times, and "no" 55 times.
3. In reality, 105 patients in the sample have the
disease, and 60 patients do not.
Lets understand the basic terms
• True positives
• True Negatives
• False positives
• False Negatives
Confusion Matrix
Performance of Logistic Regression Model
62
1. AIC (Akaike Information Criteria) – The analogous metric of adjusted R² in logistic
regression is AIC. AIC is the measure of fit which penalizes model for the number of
model coefficients. Therefore, we always prefer model with minimum AIC value.
2. Confusion Matrix : It is the tabular representation of Actual vs Predicted Values. It
helps us find the accuracy of the model.
The accuracy is calculated using the following equation
3. ROC Curve:
• The ROC Curve is known as Receiver Operating Characteristic Curve. It evaluates
the trade off between true positive rate and false positive rate.
• It is advisable to assume p>0.5 (threshold value) since, we are more concerned
about the success of the model.
• Higher the area under the curve, better the prediction power of the model would be.
Logistics Regression Assumptions
63
• Logistics Regression does not need any linear relationship between the dependent
and independent variables.
• The error (residuals) need not be normally distributed.
• There should be little to no multicollinearity amongst the independent variables.
• The outcome is binary variable like yes or no, 1 or 0, positive or negative etc.
• For a Binary Regression, the factor level 1 of the dependent variable should represent
the desired outcome.
• There is a linear relationship between the logit of the outcome and each predictor
variables. Recall that the logit function is logit(p) = log(p/(1-p)), where p is the
probabilities of the outcome.
• Logistics Regression requires quite large sample sizes.
Forward &
Backward
Selection
64
Forward Selection
65
1. Its a process which begins with empty model and keeps adding variables one by one.
2. Begins with Intercept. Tests are performed to find the relevant variables as ‘best’ variables.
3. The best variable shall return the highest coefficient of Determination or R-Squared Value.
4. This process keeps going and once the model no longer improves the accuracy by adding more
variables, the process stops.
5. Several Criterions are used to determine which variable goes in – lowest RMSE on cross
validation, F Test Score or lowest P Value.
Tip: While using Forward Selection, in order to test the accuracy of the model, its better to use the trained
Classifier against test data to make predictions.
Backward Selection
66
1. Its a process which begins with all variables and keeps removing predictors one by one.
2. Removing the variable with the largest p-value | meaning the variable that is least significant.
3. The new (p-1) variable model is a better model where the largest p value is removed.
4. This process keeps going and once the model has significant p value defined, we may stop the
process.
5. Several Criterions are used to determine which variable goes out – lowest RMSE on cross
validation, F Test Score or lowest P Value
Let’s review some concepts
Linear Regression Assumptions of
Linear Regression
Difference Between
Linear and Logistics
Regression
Logistics
Regression
Diagnostic Plots Forward and
Backward
Elimination
67
Thanks!
Any questions?
You can find me at Linkedin
@mkschauhan
mukul.mschauhan@gmail.com
68
https://www.linkedin.com/pulse/data-visualisation-using-seaborn-mukul-kr-singh-chauhan
https://www.linkedin.com/pulse/introduction-ggplot-series-mukul-kr-singh-chauhan/
https://www.linkedin.com/pulse/what-data-science-mukul-chauhan/

More Related Content

What's hot

Missing data handling
Missing data handlingMissing data handling
Missing data handling
QuantUniversity
 
Module 3: Linear Regression
Module 3:  Linear RegressionModule 3:  Linear Regression
Module 3: Linear Regression
Sara Hooker
 
Scatter plots
Scatter plotsScatter plots
Scatter plots
swartzje
 
Polynomial regression
Polynomial regressionPolynomial regression
Polynomial regression
naveedaliabad
 
Logistic regression
Logistic regressionLogistic regression
Logistic regression
saba khan
 
Decision tree
Decision treeDecision tree
Decision tree
R A Akerkar
 
Linear regression analysis
Linear regression analysisLinear regression analysis
Linear regression analysis
Nimrita Koul
 
Decision tree
Decision treeDecision tree
Decision tree
Ami_Surati
 
Exploratory Data Analysis
Exploratory Data AnalysisExploratory Data Analysis
Exploratory Data Analysis
Umair Shafique
 
Decision Trees
Decision TreesDecision Trees
Decision Trees
Student
 
Logistic regression
Logistic regressionLogistic regression
Logistic regression
Venkata Reddy Konasani
 
Maximum likelihood estimation
Maximum likelihood estimationMaximum likelihood estimation
Maximum likelihood estimation
zihad164
 
Logistic Regression Analysis
Logistic Regression AnalysisLogistic Regression Analysis
Logistic Regression Analysis
COSTARCH Analytical Consulting (P) Ltd.
 
Multiple linear regression
Multiple linear regressionMultiple linear regression
Multiple linear regression
James Neill
 
Performance Evaluation for Classifiers tutorial
Performance Evaluation for Classifiers tutorialPerformance Evaluation for Classifiers tutorial
Performance Evaluation for Classifiers tutorial
Bilkent University
 
Linear models for data science
Linear models for data scienceLinear models for data science
Linear models for data science
Brad Klingenberg
 
Descriptive Statistics and Data Visualization
Descriptive Statistics and Data VisualizationDescriptive Statistics and Data Visualization
Descriptive Statistics and Data Visualization
Douglas Joubert
 
Ordinal Logistic Regression
Ordinal Logistic RegressionOrdinal Logistic Regression
Ordinal Logistic Regression
Al-Ahmadgaid Asaad
 
Point Estimation
Point Estimation Point Estimation
Inferential statistics
Inferential statisticsInferential statistics
Inferential statistics
Maria Theresa
 

What's hot (20)

Missing data handling
Missing data handlingMissing data handling
Missing data handling
 
Module 3: Linear Regression
Module 3:  Linear RegressionModule 3:  Linear Regression
Module 3: Linear Regression
 
Scatter plots
Scatter plotsScatter plots
Scatter plots
 
Polynomial regression
Polynomial regressionPolynomial regression
Polynomial regression
 
Logistic regression
Logistic regressionLogistic regression
Logistic regression
 
Decision tree
Decision treeDecision tree
Decision tree
 
Linear regression analysis
Linear regression analysisLinear regression analysis
Linear regression analysis
 
Decision tree
Decision treeDecision tree
Decision tree
 
Exploratory Data Analysis
Exploratory Data AnalysisExploratory Data Analysis
Exploratory Data Analysis
 
Decision Trees
Decision TreesDecision Trees
Decision Trees
 
Logistic regression
Logistic regressionLogistic regression
Logistic regression
 
Maximum likelihood estimation
Maximum likelihood estimationMaximum likelihood estimation
Maximum likelihood estimation
 
Logistic Regression Analysis
Logistic Regression AnalysisLogistic Regression Analysis
Logistic Regression Analysis
 
Multiple linear regression
Multiple linear regressionMultiple linear regression
Multiple linear regression
 
Performance Evaluation for Classifiers tutorial
Performance Evaluation for Classifiers tutorialPerformance Evaluation for Classifiers tutorial
Performance Evaluation for Classifiers tutorial
 
Linear models for data science
Linear models for data scienceLinear models for data science
Linear models for data science
 
Descriptive Statistics and Data Visualization
Descriptive Statistics and Data VisualizationDescriptive Statistics and Data Visualization
Descriptive Statistics and Data Visualization
 
Ordinal Logistic Regression
Ordinal Logistic RegressionOrdinal Logistic Regression
Ordinal Logistic Regression
 
Point Estimation
Point Estimation Point Estimation
Point Estimation
 
Inferential statistics
Inferential statisticsInferential statistics
Inferential statistics
 

Similar to Linear and Logistics Regression

Simple lin regress_inference
Simple lin regress_inferenceSimple lin regress_inference
Simple lin regress_inference
Kemal İnciroğlu
 
Unit-III Correlation and Regression.pptx
Unit-III Correlation and Regression.pptxUnit-III Correlation and Regression.pptx
Unit-III Correlation and Regression.pptx
Anusuya123
 
Regression analysis algorithm
Regression analysis algorithm Regression analysis algorithm
Regression analysis algorithm
Sammer Qader
 
FSE 200AdkinsPage 1 of 10Simple Linear Regression Corr.docx
FSE 200AdkinsPage 1 of 10Simple Linear Regression Corr.docxFSE 200AdkinsPage 1 of 10Simple Linear Regression Corr.docx
FSE 200AdkinsPage 1 of 10Simple Linear Regression Corr.docx
budbarber38650
 
Correlation and Regression ppt
Correlation and Regression pptCorrelation and Regression ppt
Correlation and Regression ppt
Santosh Bhaskar
 
LINEAR REGRESSION.pptx
LINEAR REGRESSION.pptxLINEAR REGRESSION.pptx
LINEAR REGRESSION.pptx
neelamsanjeevkumar
 
Statistics For Management 3 October
Statistics For Management 3 OctoberStatistics For Management 3 October
Statistics For Management 3 October
Dr. Trilok Kumar Jain
 
correlation.pptx
correlation.pptxcorrelation.pptx
correlation.pptx
SmHasiv
 
Regression analysis
Regression analysisRegression analysis
Regression analysis
Srikant001p
 
Chapter 2 part3-Least-Squares Regression
Chapter 2 part3-Least-Squares RegressionChapter 2 part3-Least-Squares Regression
Chapter 2 part3-Least-Squares Regression
nszakir
 
Stat 1163 -correlation and regression
Stat 1163 -correlation and regressionStat 1163 -correlation and regression
Stat 1163 -correlation and regression
Khulna University
 
Research method ch09 statistical methods 3 estimation np
Research method ch09 statistical methods 3 estimation npResearch method ch09 statistical methods 3 estimation np
Research method ch09 statistical methods 3 estimation np
naranbatn
 
Chapter 10
Chapter 10Chapter 10
Chapter 10
guest3720ca
 
Chapter 10
Chapter 10Chapter 10
Chapter 10
Rose Jenkins
 
Simple Regression Analysis ch12.pptx
Simple Regression Analysis ch12.pptxSimple Regression Analysis ch12.pptx
Simple Regression Analysis ch12.pptx
SoumyaBansal7
 
Data Science - Part XII - Ridge Regression, LASSO, and Elastic Nets
Data Science - Part XII - Ridge Regression, LASSO, and Elastic NetsData Science - Part XII - Ridge Regression, LASSO, and Elastic Nets
Data Science - Part XII - Ridge Regression, LASSO, and Elastic Nets
Derek Kane
 
What is Isotonic Regression and How Can a Business Utilize it to Analyze Data?
What is Isotonic Regression and How Can a Business Utilize it to Analyze Data?What is Isotonic Regression and How Can a Business Utilize it to Analyze Data?
What is Isotonic Regression and How Can a Business Utilize it to Analyze Data?
Smarten Augmented Analytics
 
how to select the appropriate method for our study of interest
how to select the appropriate method for our study of interest how to select the appropriate method for our study of interest
how to select the appropriate method for our study of interest
NurFathihaTahiatSeeu
 
Correlation and Regression
Correlation and Regression Correlation and Regression
Correlation and Regression
Dr. Tushar J Bhatt
 
Data Science - Part IV - Regression Analysis & ANOVA
Data Science - Part IV - Regression Analysis & ANOVAData Science - Part IV - Regression Analysis & ANOVA
Data Science - Part IV - Regression Analysis & ANOVA
Derek Kane
 

Similar to Linear and Logistics Regression (20)

Simple lin regress_inference
Simple lin regress_inferenceSimple lin regress_inference
Simple lin regress_inference
 
Unit-III Correlation and Regression.pptx
Unit-III Correlation and Regression.pptxUnit-III Correlation and Regression.pptx
Unit-III Correlation and Regression.pptx
 
Regression analysis algorithm
Regression analysis algorithm Regression analysis algorithm
Regression analysis algorithm
 
FSE 200AdkinsPage 1 of 10Simple Linear Regression Corr.docx
FSE 200AdkinsPage 1 of 10Simple Linear Regression Corr.docxFSE 200AdkinsPage 1 of 10Simple Linear Regression Corr.docx
FSE 200AdkinsPage 1 of 10Simple Linear Regression Corr.docx
 
Correlation and Regression ppt
Correlation and Regression pptCorrelation and Regression ppt
Correlation and Regression ppt
 
LINEAR REGRESSION.pptx
LINEAR REGRESSION.pptxLINEAR REGRESSION.pptx
LINEAR REGRESSION.pptx
 
Statistics For Management 3 October
Statistics For Management 3 OctoberStatistics For Management 3 October
Statistics For Management 3 October
 
correlation.pptx
correlation.pptxcorrelation.pptx
correlation.pptx
 
Regression analysis
Regression analysisRegression analysis
Regression analysis
 
Chapter 2 part3-Least-Squares Regression
Chapter 2 part3-Least-Squares RegressionChapter 2 part3-Least-Squares Regression
Chapter 2 part3-Least-Squares Regression
 
Stat 1163 -correlation and regression
Stat 1163 -correlation and regressionStat 1163 -correlation and regression
Stat 1163 -correlation and regression
 
Research method ch09 statistical methods 3 estimation np
Research method ch09 statistical methods 3 estimation npResearch method ch09 statistical methods 3 estimation np
Research method ch09 statistical methods 3 estimation np
 
Chapter 10
Chapter 10Chapter 10
Chapter 10
 
Chapter 10
Chapter 10Chapter 10
Chapter 10
 
Simple Regression Analysis ch12.pptx
Simple Regression Analysis ch12.pptxSimple Regression Analysis ch12.pptx
Simple Regression Analysis ch12.pptx
 
Data Science - Part XII - Ridge Regression, LASSO, and Elastic Nets
Data Science - Part XII - Ridge Regression, LASSO, and Elastic NetsData Science - Part XII - Ridge Regression, LASSO, and Elastic Nets
Data Science - Part XII - Ridge Regression, LASSO, and Elastic Nets
 
What is Isotonic Regression and How Can a Business Utilize it to Analyze Data?
What is Isotonic Regression and How Can a Business Utilize it to Analyze Data?What is Isotonic Regression and How Can a Business Utilize it to Analyze Data?
What is Isotonic Regression and How Can a Business Utilize it to Analyze Data?
 
how to select the appropriate method for our study of interest
how to select the appropriate method for our study of interest how to select the appropriate method for our study of interest
how to select the appropriate method for our study of interest
 
Correlation and Regression
Correlation and Regression Correlation and Regression
Correlation and Regression
 
Data Science - Part IV - Regression Analysis & ANOVA
Data Science - Part IV - Regression Analysis & ANOVAData Science - Part IV - Regression Analysis & ANOVA
Data Science - Part IV - Regression Analysis & ANOVA
 

Recently uploaded

Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
Bill641377
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
Sm321
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
AndrzejJarynowski
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
Walaa Eldin Moustafa
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
manishkhaire30
 
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
xclpvhuk
 
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
ihavuls
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
nuttdpt
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Aggregage
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Kiwi Creative
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
Social Samosa
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
v7oacc3l
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
nyfuhyz
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
aqzctr7x
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
Timothy Spann
 
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
wyddcwye1
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
Lars Albertsson
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
apvysm8
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
Sachin Paul
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
soxrziqu
 

Recently uploaded (20)

Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
 
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
 
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
 
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
 

Linear and Logistics Regression

  • 2. Agenda for Today’s Session WHAT IS REGRESSION We will understand the Regression and will move onto the types. REGRESSION DIAGNOSTICS Lets find what best works with Linear Regression 2 LINEAR REGRESSION vs LOGISITIC Lets find out the difference between Linear & Logistic Regression. UNDERSTANDING LINEAR REGRESSION ALGORITHM ASSUMPTIONS Lets Look into the Assumptions & Violations UNDERSTANDING LOGISTIC REGRESSION Lets Look into the Algorithm & understand how it functions. INTERVIEW QUESTIONS Covering some interview questions to strengthen the knowledge and give an idea about Interviews too.
  • 5. Understanding Linear Regression Algorithm 5 Establishes a relationship between the Independent & Dependent Variables. Examples of Independent & Dependent Variables:- • x is Rainfall and y is Crop Yield • x is Advertising Expense and y is Sales • x is sales of goods and y is GDP Here x is Independent Variable & Y is Dependent Variable Intro How it Works • Regression analysis is used to understand which among the Independent Variables are related to Dependent Variables. • It attempts to model relationship between two variables by fitting a line called Linear Regression Line. • The case of Single variable is called Simple Linear Regression where as the case of Multiple Independent Variables, it is called Multiple Linear Regression
  • 6. 6 Single Linear Regression Vs Multiple Linear Regression The Linear Regression line is created using Ordinary Least Square Method. X Y Simple Linear Regression Multiple Linear Regression X1 Y X2 X3 X4 Multiple Predictors
  • 7. Linear Regression Equation 7 y = mx + c Slope/Gradient Y Intercept
  • 8. Sum of Squared Error What is Error? • Actual Value – Predicted Value is called Error • Here Predicted Value is the value predicted by the Linear Regression Model. • Also known as Residual. Why it is Important? Smaller the residuals, more accurate model it would be.
  • 9. Regression Line | Best Fit Line What is the Line of Best Fit? • The Best Fit Line is the line that gives the minimum SSE. • Amongst all the possible lines, there will be one line that will be the best fit meaning greatest possible accuracy as a model. • The line that minimizes the sum of squared error of residuals is called Regression Line or the Best Fit Line. • In Simple Terms, it represents a straight line that best represents the data on scatterplot. It is drawn using the Least Square Method. Use the Least Square Method to Determine the Eqn. of Line of Best Fit x 8 2 11 6 5 4 12 9 6 1 y 3 10 3 6 8 12 1 4 9 14
  • 10. Finding Best Fit Line Algorithm How to find the Best Fit Line? • The equation of the straight line is given by y = mx +c m – slope of the line c - Intercept (The point at which the straight line touches y axis. • The Best Fit line is found basis the Least Squared Method. Algorithm: Step1: Find the Mean of x-values and y-values Step2: Calculate the slope of the line. It can be found using the following eqn. on the right. Step3: Compute the y-intercept of the line using the formula Mean of x and y values Finding m (Slope)
  • 11. Regression Line | Best Fit Line Use the Least Square Method to Determine the Eqn. of Line of Best Fit Steps: Following Steps are deployed to achieve the objective. Step1: Calculate the Mean of X and Ys Step2: Find the Following:-
  • 12. Regression Line | Best Fit Line Step1: Calculate the Mean of X and Ys The mean value of x is 6.4 and y is 7 Step2: Find the m (slope) m (slope) = -131/118.4 = -1.1 approx. Step3: Calculate the y intercept b (intercept) = 7 – (-1.1) * 6.4 = 14.0 approx. Thus, the equation of the line is y = -1.1x + 14
  • 13. Regression Equation and Line of Best Fit
  • 14. R Squared 1. R Squared is a statistical measure that represents the proportion of the variance for a DV explained by IV. 2. While correlation defines the strength, R Squared explains up to what extent the variance of one variable explains the variance of another var. 3. Example – In Investing, the R Squared is %age of a fund movement that can be explained by the movement of benchmark(sensex) 4. Aka Coefficient of Determination.
  • 16. How to see if the Assumptions are Violated – Deciding if Linear Model is a good Fit 16 Residual vs Fitted Values Plot 1. The x-axis has the fitted values and y axis has the Residuals. 2. Residual = Observed y value – Predicted y value. 3. Vertical distance of actual point vs line of the best fit is called Residual. 4. If unsure about the shape (curve) for regression equation by looking into the scatterplot, a residual plot helps in making decision. When a pattern is observed in a residual plot, a linear regression model is probably not appropriate for your data. Data should be randomly scattered around line 0
  • 17. Normal Q-Q Plot (Quantile Quantile Plot) 17 1. If the data is normally distributed, the points in the QQ-normal plot lie on a straight diagonal line. 2. Greater the departure from this reference line, the greater the evidence that the data is not following the normal distribution pattern.
  • 18. Interview Questions 18 01. Which of the following is true about Residuals? A) Lower is Better B) Higher is Better C) A or B depends on the situation. D) None of these
  • 19. Interview Questions - Solution 19 01. Which of the following is true about Residuals? A) Lower is Better B) Higher is Better C) A or B depends on the situation. D) None of these Solution A: Residuals refer to the error value of the model. Hence, Lower is Better. ( Residual = y – yhat)
  • 20. Interview Questions 20 02. Which of the statement is true regarding residuals in regression A) Mean of the Residuals is always zero. B) Mean of the Residuals is always less than zero. C) Mean of the Residuals is always more than zero. D) There is no such rule for residuals.
  • 21. Interview Questions - Solution 21 02. Which of the statement is true regarding residuals in regression A) Mean of the Residuals is always zero. B) Mean of the Residuals is always less than zero. C) Mean of the Residuals is always more than zero. D) There is no such rule for residuals. Solution: A Sum of residual in regression is always zero. It the sum of residuals is zero, the ‘Mean’ will also be zero.
  • 22. Interview Questions 22 03. To Test linear relationship of y (dependent) and x (independent) continuous variable, which of the following plots are best suited. A) Scatterplot B) Barplot C) Histograms D) None of These.
  • 23. Interview Questions - Solution 23 03. To Test linear relationship of y (dependent) and x (independent) continuous variable, which of the following plots are best suited. A) Scatterplot B) Barplot C) Histograms D) None of These. Solution: A To test the linear relationship between continuous variables Scatter plot is a good option. We can find out how one variable is changing w.r.t. another variable. A scatter plot displays the relationship between two quantitative variables.
  • 24. Interview Questions 24 04. A Correlation between the age and health of a person found to be -1.09. On the basis of this you would tell the doctors that: A) The age is good predictor of health B) The age is poor predictor of health C) None of These.
  • 25. Interview Questions - Solution 25 04. A Correlation between the age and health of a person found to be -1.09. On the basis of this you would tell the doctors that: A) The age is good predictor of health B) The age is poor predictor of health C) None of These. Solution: C Correlation coefficient range is between [-1 ,1]. So -1.09 is not possible.
  • 26. Interview Questions 26 05. Which of the following offsets, do we use in case of least square line fit? Suppose horizontal axis is independent variable and vertical axis is dependent variable. A) Vertical Offset B) Perpendicular Offset C) Both D) None of the Above.
  • 27. Interview Questions - Solution 27 05. Which of the following offsets, do we use in case of least square line fit? Suppose horizontal axis is independent variable and vertical axis is dependent variable. A) Vertical Offset B) Perpendicular Offset C) Both D) None of the Above. Solution: A We always consider residual as vertical offsets. Perpendicular offset are useful in case of PCA.
  • 28. 28 Linear Regression – Model Assumptions Since Linear Regression assesses whether one or more predictor variables explain the dependent Variable and hence it has 05 assumptions: 1. Linear Relationship 2. Normality 3. No or Little Multicollinearity 4. No Auto Correlation in errors 5. Homoscedasticity Note on sample size – The sample size thumb rule is that regression analysis requires at least 20 data points per independent variable in analysis.
  • 29. 29 1. Check for Linearity 1. Linear Regression needs the relationship between the independent & dependent variable to linear & additive. 2. Being additive means the effect of x on y is independent of other variables. 3. The linearity can be checked using the scatter plots. 4. Some examples are shown on right. It shows little to no correlation
  • 30. Transforming Variables to Achieve Linearity 30 • Each Row shows a different transformation method. • Transform column shows the method of transformation to be applied on DV or IV. • Regression equation is the equation used in analysis. • Last Column shows the equation of Prediction.
  • 31. Non Linear to Linear Conversion 31 The best transformation depends of the data & the best model will give the highest coefficient of Determination. Steps Involved:- 1. Create Linear Regression Model. 2. Construct a residual plot 3. If the plot is random, don’t transform the data. 4. Compute the Coefficient of Determination (R2) 5. Choose a Transformation method as mentioned in table in previous slide. 6. Transform IV or DV or both. 7. Apply Regression 8. If the Transformed R2 is greater than the previous score, the transformation is a success.
  • 32. Transformation Example 32 Objectives: 1. Create Linear Regression Model in Excel or R. 2. Find the Linear Regression Equation 3. Make Predictions 4. Create a Residual Plot and Find if Linear Regression is an Absolute Fit or not. 5. If Not, Try Transformation 0 10 20 30 40 50 60 70 80 0 2 4 6 8 10 Scatterplot Y Vs X
  • 33. 33 2. Check for Normality 1. Linear Regression requires all the variables need to be normal. 2. The normality can be checked using the histogram or Q-Q Plots. 3. Test for Normality aka goodness of fit test is called Kolmogorov Smirnov Test or Shapiro Wilk Normality Test 4. If the Data is not normal a non linear transformation ( e.g. Log Transformation) can fix the issue. 5. Normality means that Y values are normally distributed to each X.
  • 34. 34 3. Check for Multicollinearity 1. It means that the predictors are correlated with each other. Presence of correlation in independent variables lead of Multicollinearity. 2. What happens if variables are correlated - it becomes difficult for the model to determine the true effect of Independent Variables on Dependent. 3. Measure of Multicollinearity is given by VIF (Variable Influence Factor) 1. VIF tells us if the predictors are correlated, how much variance of an estimated coefficient increases. If no factors are correlated, VIF will be 1. 2. If VIF is 1 - No Multicollinearity. 3. VIF>1, the predictors may be correlated. 4. VIF between 5 & 10 – Indicates high correlation. 5. VIF >10: Regression coefficients are poorly estimated due to Multicollinearity. Solution: 1. To drop the variable showing high Collinearity. The presence of C suggests that the information provided by this variable for the DV is redundant and is of no use. 2. Another Approach is to combined the collinear variables and create new predictor (for e.g. taking average).
  • 36. 36
  • 37. 4. Heteroscedasticity 37 Image Source: Google Meaning that Data has different dispersion. In other terms, it is called with unequal scatter. Why it is a Problem It is a problem because OLS Regression assumes that all residuals are drawn from population that has constant variance ( Homoscedasticity). Ruins Results. Gives Biased Coefficients. Heteroscedasticity, also spelled heteroskedasticity, occurs more often in datasets that have a large range between the largest and smallest observed values. A classic example of heteroscedasticity is If you model household consumption based on income, you’ll find that the variability in consumption increases as income increases. Breusch-Pagan / Cook Weisberg Test - This test is used to determine presence of heteroskedasticity. If you find p < 0.05, you reject the null hypothesis and infer that heteroskedasticity is present.
  • 38. Scale Location Plot 38 • This plot is also useful to determine heteroskedasticity. Ideally, this plot shouldn't show any pattern. • Presence of a pattern determine heteroskedasticity. • Don't forget to corroborate the findings of this plot with the funnel shape in residual vs. fitted values. •
  • 39. Leverage Plot 39 Leverage: An observation with an extreme value on a predictor variable is called a point with high leverage. Leverage is a measure of how far an observation deviates from the mean of that variable. These leverage points can have an effect on the estimate of regression coefficients.
  • 40. Summary of Tests in Python for Linear Regression Assumptions 40 Multicollinearity Test from statsmodels.stats.outliers_influence import variance_inflation_factor Variance Inflation Factor Normality Test from scipy.stats import shapiro Shapiro Wilk Test Jarque Bera Test Autocorrelation Test Durbin Watson Test Breusch Pagan Test Heteroscedasticity Test import statsmodels.stats.api as sts Goldfeld Quandt Test Breusch Pagan Test Non Linearity Test import statsmodels.stats.api as sts Linear Rainbow Test
  • 41. Auto Correlation of Residuals 41 Auto Correlation of Errors means that the errors are Correlated. Assumption is that the Linear Regression Model Residuals are Not Correlated. Test of Assumption – Durbin Watson Test Package - statsmodels.stats.stattools.durbin_watson(resids, axis=0) What is the Null Hypothesis The Null Hypothesis of the test is that there is no serial correlation. Statistics (Always between 0 and 4) • The test statistic is equal to 2*(1-r) where r is the sample autocorrelation of the residuals. • Thus for r==0 indicating no serial correlation, the test statistic equals 2. • Closer to 0, more evidence for positive serial correlation and closer to 4 indicates negative serial correlation.
  • 42. Assessing Goodness of Fit - R2 42 After fitting the model, it becomes essential to understand how well the model fits the data. When the Model Fits Best on the Data? A Model fits the data well if the difference between the actual value and the model’s predicted value is small and unbiased. What is R-Squared (R2)? It is a statistical measure of how close the data is to the fitted regression line. The definition of R- squared is fairly straight-forward; it is the percentage of the response variable variation that is explained by a linear model. Or: R-squared = Explained variation / Total variation R-squared is always between 0 and 100%: 0% indicates that the model explains none of the variability of the response data around its mean. 100% indicates that the model explains all the variability of the response data around its mean. In general, the higher the R-squared, the better the model fits your data.
  • 43. Interview Questions 43 01. True-False: Linear Regression is a supervised machine learning algorithm. A) TRUE B) FALSE
  • 44. Interview Questions - Solution 44 True-False: Linear Regression is a supervised machine learning algorithm. A) TRUE B) FALSE Solution A: Yes, Linear regression is a supervised learning algorithm because it uses true labels labels for training. Supervised learning algorithm should have input variable (x) and an variable (Y) for each example.
  • 45. Interview Questions 45 02. Which of the following methods do we use to find the best fit line for data in Linear Regression? A) Least Square Method B) Maximum Likelihood C) Both A and B
  • 46. Interview Questions - Solution 46 02. Which of the following methods do we use to find the best fit line for data in Linear Regression? A) Least Square Method B) Maximum Likelihood C) Both A and B Solution - A: In Linear Regression, we use the Least Square Method to identify the Best Fit Line.
  • 47. Interview Questions 47 03. Which of the following evaluation metrics can be used to evaluate a model while modelling a continuous output variable? A) AUC - ROC B) Accuracy C) Mean Squared Error
  • 48. Interview Questions - Solution 48 03. Which of the following evaluation metrics can be used to evaluate a model while modelling a continuous output variable? A) AUC - ROC B) Accuracy C) Mean Squared Error Solution - Since Linear Regression gives the output as continuous values and hence we use Mean Squared Error metric to evaluate the model performance. Rest of the options are used in case of classification problem.
  • 49. Interview Questions 49 05. Which of the following statements is true about outliers in Linear Regression A) Linear Regression is sensitive to outliers B) Linear Regression is not sensitive to outliers C) No Idea. D) None of these
  • 50. Interview Questions 50 05. Which of the following statements is true about outliers in Linear Regression A) Linear Regression is sensitive to outliers B) Linear Regression is not sensitive to outliers C) No Idea. D) None of these Solution A: The slope of regression line will change if outliers are present in the data. Therefore, it is sensitive to the Outliers.
  • 51. Linear Regression Example 51 Objectives: 1. Create Linear Regression Model in Excel or R. 2. Find the Linear Regression Equation 3. Make Predictions 4. Create a Residual Plot and Find if Linear Regression is an Absolute Fit or not x 1 2 3 4 5 Y 2 1 3.5 3 4.5 2 1 3.5 3 4.5 y = 0.7x + 0.7 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 0 1 2 3 4 5 6 Scatterplot X & Y
  • 52. Assessing Model Fit 52 Residuals The distance of the point (Actual Line) with the Line (Line of Prediction) Root Mean Squared Error (RMSE) “Residual Standard Error” in Linear Model Output. It is interpreted as how far on an average, the residuals are from zero. Mean Absolute Error(MAE) Mean Absolute Error is another metric to evaluate the model. For example actual y is 10 and predictive y is 30. the resultant MAE would be (30-10) = 20. MAE is very robust against the effect of Outliers. R Square This metric explains the percentage of variance in the model. It ranges between 0 and 1. A higher value is always appreciated. Since R Square increases as the new data is introduced (variables) regardless of the fact that new variable is actually adding new information to the model. To overcome this, we look upto Adj. R Square which is steady and only inc or dec if the newly added variable is truly useful. Adjusted R Square
  • 53. Difference Between Linear and Logistic Regression 53 Linear Regression Logistic Regression The Data is Modelled using a Straight Line A statistical model that predicts the probability of an outcome that can have two values The Outcome (Dependent Variable) is continuous in Nature The Outcome (Dependent Variable) has only limited no of possible values Output Variable is continuous Output Variable is Discrete Used to Solve Regression Problems Used to solve the classification problems (Binary Classification) Estimate the Dependent Variable when there is a change in Independent Variable Calculates Probability of Occurrence of an Event Linear regression uses ordinary least squares method to minimise the errors and arrive at a best possible fit Logistic Regression uses maximum likelihood method to arrive at the solution Uses a Straight Line Uses a S Curve or Sigmoid Function Example - Predicting Sales, House Prices, GDP etc Predicting if email is Spam or not, credit card transaction is fraud or not or customer will buy the product or not
  • 55. Logistics Regression 55 • Logistics Regression is used when the dependent variable is categorical. • The values are strictly in the range of 0 and 1. • It is used to describe data and to explain the relationship between one dependent binary variable and one or more nominal, ordinal, interval or ratio- level independent variables.
  • 56. Logistics Regression Equation 56 • Here g() is the link function. The role of link function is to ‘link’ the expectation of y to linear predictor. • E(y) is the expectation of target variable • α + βx1 + γx2 – linear predictors. (α,β,γ to be predicted). Fundamental equation of generalized linear model: g(E(y)) = α + βx1 + γx2 Lets take a simple linear regression equation with dependent variable enclosed in a link function: g(y) = βo + β(Age) Here g() function is trying to establish probability of success (p) or probability of failure(1-p) Criteria for p • It must always be positive (p>=0) • It must always be less than equal to 1 (p<=1) p = exp(βo + β(Age)) Since probability must always be positive, we’ll put the linear equation in exponential form. For any value of slope and dependent variable, exponent of this equation will never be negative.
  • 57. Logistics Regression Equation 57 p = exp(βo + β(Age)) In order to make the probability less than 1, we must divide p by a number greater than p. p = exp(βo + β(Age)) / exp(βo + β(Age))+1 Since g(y) = βo + β(Age) and p = exp(βo + β(Age)), this gives a new equation called p = e^y / 1+ e^y This is the Logit Function. The probability of success is given by p = e^y / 1+ e^y and probability of failure is given by q = 1-p or q = 1 - e^y / 1+ e^y On dividing both the equations, we get the following: Or
  • 58. Logistics Regression equation 58 Final Equation Log(p/1-p) is the link function and (p/1-p) is the odd ratio. If the log of odd ratio is found positive, The probability of success is always more than 50%.
  • 59. Sigmoid Function 59 • The Sigmoid Function also called Logistics Function gives S shape that can take any real value and map into a value between 0 and 1. • The range of the values is between 0 and 1. • If the output of the sigmoid function is more than 0.5, we classify the outcome as 1 or Yes. • If the output of the sigmoid function is less than 0.5, we classify the outcome as 0 or No.
  • 60. ROC Curve 60 • ROC Is a probability curve and AUC represents degree or measure of separability. • Higher the AUC better the model is. • Here TPR is called the True Positive Rate aka Recall or Sensitivity and is given by TP / (TP+FN). • Specificity = TN / (TN + FP) and FPR (False Positive Rate) is 1 – Specificity.
  • 61. What is a Confusion Matrix Image Source: Google Y Actual Values Y hat (Predicted Values) Lets Say we are predicting the presence of disease which means yes – they have disease and no means – they don’t. 1. The classifier made total 165 predictions. 2. Out of those 165 cases, the classifier predicted "yes" 110 times, and "no" 55 times. 3. In reality, 105 patients in the sample have the disease, and 60 patients do not. Lets understand the basic terms • True positives • True Negatives • False positives • False Negatives Confusion Matrix
  • 62. Performance of Logistic Regression Model 62 1. AIC (Akaike Information Criteria) – The analogous metric of adjusted R² in logistic regression is AIC. AIC is the measure of fit which penalizes model for the number of model coefficients. Therefore, we always prefer model with minimum AIC value. 2. Confusion Matrix : It is the tabular representation of Actual vs Predicted Values. It helps us find the accuracy of the model. The accuracy is calculated using the following equation 3. ROC Curve: • The ROC Curve is known as Receiver Operating Characteristic Curve. It evaluates the trade off between true positive rate and false positive rate. • It is advisable to assume p>0.5 (threshold value) since, we are more concerned about the success of the model. • Higher the area under the curve, better the prediction power of the model would be.
  • 63. Logistics Regression Assumptions 63 • Logistics Regression does not need any linear relationship between the dependent and independent variables. • The error (residuals) need not be normally distributed. • There should be little to no multicollinearity amongst the independent variables. • The outcome is binary variable like yes or no, 1 or 0, positive or negative etc. • For a Binary Regression, the factor level 1 of the dependent variable should represent the desired outcome. • There is a linear relationship between the logit of the outcome and each predictor variables. Recall that the logit function is logit(p) = log(p/(1-p)), where p is the probabilities of the outcome. • Logistics Regression requires quite large sample sizes.
  • 65. Forward Selection 65 1. Its a process which begins with empty model and keeps adding variables one by one. 2. Begins with Intercept. Tests are performed to find the relevant variables as ‘best’ variables. 3. The best variable shall return the highest coefficient of Determination or R-Squared Value. 4. This process keeps going and once the model no longer improves the accuracy by adding more variables, the process stops. 5. Several Criterions are used to determine which variable goes in – lowest RMSE on cross validation, F Test Score or lowest P Value. Tip: While using Forward Selection, in order to test the accuracy of the model, its better to use the trained Classifier against test data to make predictions.
  • 66. Backward Selection 66 1. Its a process which begins with all variables and keeps removing predictors one by one. 2. Removing the variable with the largest p-value | meaning the variable that is least significant. 3. The new (p-1) variable model is a better model where the largest p value is removed. 4. This process keeps going and once the model has significant p value defined, we may stop the process. 5. Several Criterions are used to determine which variable goes out – lowest RMSE on cross validation, F Test Score or lowest P Value
  • 67. Let’s review some concepts Linear Regression Assumptions of Linear Regression Difference Between Linear and Logistics Regression Logistics Regression Diagnostic Plots Forward and Backward Elimination 67
  • 68. Thanks! Any questions? You can find me at Linkedin @mkschauhan mukul.mschauhan@gmail.com 68 https://www.linkedin.com/pulse/data-visualisation-using-seaborn-mukul-kr-singh-chauhan https://www.linkedin.com/pulse/introduction-ggplot-series-mukul-kr-singh-chauhan/ https://www.linkedin.com/pulse/what-data-science-mukul-chauhan/