SlideShare a Scribd company logo
1 of 68
Download to read offline
Regression
Agenda for Today’s Session
WHAT IS REGRESSION
We will understand the Regression and
will move onto the types.
REGRESSION DIAGNOSTICS
Lets find what best works with Linear
Regression
2
LINEAR REGRESSION vs LOGISITIC
Lets find out the difference between
Linear & Logistic Regression.
UNDERSTANDING LINEAR REGRESSION
ALGORITHM ASSUMPTIONS
Lets Look into the Assumptions &
Violations
UNDERSTANDING LOGISTIC REGRESSION
Lets Look into the Algorithm & understand how it
functions.
INTERVIEW QUESTIONS
Covering some interview questions to strengthen
the knowledge and give an idea about Interviews
too.
Lets Dive In
3
What is
Regression
4
Understanding Linear Regression Algorithm
5
Establishes a relationship between the Independent
& Dependent Variables.
Examples of Independent & Dependent Variables:-
• x is Rainfall and y is Crop Yield
• x is Advertising Expense and y is Sales
• x is sales of goods and y is GDP
Here x is Independent Variable & Y is Dependent Variable
Intro
How it Works
• Regression analysis is used to understand which among the
Independent Variables are related to Dependent Variables.
• It attempts to model relationship between two variables by fitting a
line called Linear Regression Line.
• The case of Single variable is called Simple Linear Regression
where as the case of Multiple Independent Variables, it is called
Multiple Linear Regression
6
Single Linear Regression Vs Multiple Linear Regression
The Linear Regression line is created using Ordinary Least Square Method.
X Y
Simple Linear Regression
Multiple Linear Regression
X1
Y
X2
X3
X4
Multiple
Predictors
Linear Regression Equation
7
y = mx + c
Slope/Gradient
Y Intercept
Sum of Squared Error
What is Error?
• Actual Value – Predicted Value is called Error
• Here Predicted Value is the value predicted by
the Linear Regression Model.
• Also known as Residual.
Why it is Important?
Smaller the residuals, more accurate model it
would be.
Regression Line | Best Fit Line
What is the Line of Best Fit?
• The Best Fit Line is the line that gives the minimum SSE.
• Amongst all the possible lines, there will be one line that will be the best fit
meaning greatest possible accuracy as a model.
• The line that minimizes the sum of squared error of residuals is called
Regression Line or the Best Fit Line.
• In Simple Terms, it represents a straight line that best represents the data on
scatterplot. It is drawn using the Least Square Method.
Use the Least Square Method to Determine the Eqn. of Line of Best Fit
x 8 2 11 6 5 4 12 9 6 1
y 3 10 3 6 8 12 1 4 9 14
Finding Best Fit Line Algorithm
How to find the Best Fit Line?
• The equation of the straight line is given by y = mx +c
m – slope of the line
c - Intercept (The point at which the straight line touches y axis.
• The Best Fit line is found basis the Least Squared Method.
Algorithm:
Step1: Find the Mean of x-values and y-values
Step2: Calculate the slope of the line. It can be found using the following eqn.
on the right.
Step3: Compute the y-intercept of the line using the formula
Mean of x and y values
Finding m (Slope)
Regression Line | Best Fit Line
Use the Least Square Method to Determine the Eqn. of Line of Best Fit
Steps: Following Steps are deployed to achieve the objective.
Step1: Calculate the Mean of X and Ys
Step2: Find the Following:-
Regression Line | Best Fit Line
Step1: Calculate the Mean of X and Ys
The mean value of x is 6.4 and y is 7
Step2: Find the m (slope)
m (slope) = -131/118.4 = -1.1 approx.
Step3: Calculate the y intercept
b (intercept) = 7 – (-1.1) * 6.4 = 14.0 approx.
Thus, the equation of the line is y = -1.1x + 14
Regression Equation and Line of Best Fit
R Squared
1. R Squared is a statistical measure
that represents the proportion of the
variance for a DV explained by IV.
2. While correlation defines the strength,
R Squared explains up to what extent
the variance of one variable explains
the variance of another var.
3. Example – In Investing, the R
Squared is %age of a fund movement
that can be explained by the
movement of benchmark(sensex)
4. Aka Coefficient of Determination.
Coefficient of Determination
How to see if the Assumptions are Violated – Deciding if Linear Model
is a good Fit
16
Residual vs Fitted Values Plot
1. The x-axis has the fitted values and
y axis has the Residuals.
2. Residual = Observed y value –
Predicted y value.
3. Vertical distance of actual point vs
line of the best fit is called Residual.
4. If unsure about the shape (curve) for
regression equation by looking into
the scatterplot, a residual plot helps
in making decision.
When a pattern is observed in a residual plot,
a linear regression model is probably not appropriate for your data.
Data should be randomly
scattered around line 0
Normal Q-Q Plot (Quantile Quantile Plot)
17
1. If the data is normally distributed, the points in the QQ-normal plot lie on a straight
diagonal line.
2. Greater the departure from this reference line, the greater the evidence that the data
is not following the normal distribution pattern.
Interview Questions
18
01. Which of the following is true about Residuals?
A) Lower is Better
B) Higher is Better
C) A or B depends on the situation.
D) None of these
Interview Questions - Solution
19
01. Which of the following is true about Residuals?
A) Lower is Better
B) Higher is Better
C) A or B depends on the situation.
D) None of these
Solution A: Residuals refer to the error value of the model. Hence, Lower is Better. (
Residual = y – yhat)
Interview Questions
20
02. Which of the statement is true regarding residuals in regression
A) Mean of the Residuals is always zero.
B) Mean of the Residuals is always less than zero.
C) Mean of the Residuals is always more than zero.
D) There is no such rule for residuals.
Interview Questions - Solution
21
02. Which of the statement is true regarding residuals in regression
A) Mean of the Residuals is always zero.
B) Mean of the Residuals is always less than zero.
C) Mean of the Residuals is always more than zero.
D) There is no such rule for residuals.
Solution: A
Sum of residual in regression is always zero. It the sum of residuals is zero,
the ‘Mean’ will also be zero.
Interview Questions
22
03. To Test linear relationship of y (dependent) and x (independent) continuous
variable, which of the following plots are best suited.
A) Scatterplot
B) Barplot
C) Histograms
D) None of These.
Interview Questions - Solution
23
03. To Test linear relationship of y (dependent) and x (independent) continuous
variable, which of the following plots are best suited.
A) Scatterplot
B) Barplot
C) Histograms
D) None of These.
Solution: A
To test the linear relationship between continuous variables Scatter plot is a
good option. We can find out how one variable is changing w.r.t. another
variable. A scatter plot displays the relationship between two quantitative
variables.
Interview Questions
24
04. A Correlation between the age and health of a person found to be -1.09. On the basis of
this you would tell the doctors that:
A) The age is good predictor of health
B) The age is poor predictor of health
C) None of These.
Interview Questions - Solution
25
04. A Correlation between the age and health of a person found to be -1.09. On the basis of
this you would tell the doctors that:
A) The age is good predictor of health
B) The age is poor predictor of health
C) None of These.
Solution: C
Correlation coefficient range is between [-1 ,1]. So -1.09 is not possible.
Interview Questions
26
05. Which of the following offsets, do we use in case of least square line fit?
Suppose horizontal axis is independent variable and vertical axis is dependent variable.
A) Vertical Offset
B) Perpendicular Offset
C) Both
D) None of the Above.
Interview Questions - Solution
27
05. Which of the following offsets, do we use in case of least square line fit?
Suppose horizontal axis is independent variable and vertical axis is dependent variable.
A) Vertical Offset
B) Perpendicular Offset
C) Both
D) None of the Above.
Solution: A
We always consider residual as vertical offsets. Perpendicular offset are useful in case
of PCA.
28
Linear Regression – Model Assumptions
Since Linear Regression assesses whether one or more predictor variables explain the dependent
Variable and hence it has 05 assumptions:
1. Linear Relationship
2. Normality
3. No or Little Multicollinearity
4. No Auto Correlation in errors
5. Homoscedasticity
Note on sample size – The sample size thumb rule is that regression analysis requires at least 20 data
points per independent variable in analysis.
29
1. Check for Linearity
1. Linear Regression needs the
relationship between the independent &
dependent variable to linear & additive.
2. Being additive means the effect of x on y
is independent of other variables.
3. The linearity can be checked using the
scatter plots.
4. Some examples are shown on right. It
shows little to no correlation
Transforming Variables to Achieve Linearity
30
• Each Row shows a different transformation method.
• Transform column shows the method of transformation to be applied on DV or IV.
• Regression equation is the equation used in analysis.
• Last Column shows the equation of Prediction.
Non Linear to Linear Conversion
31
The best transformation depends of the data & the best model will give the highest
coefficient of Determination.
Steps Involved:-
1. Create Linear Regression Model.
2. Construct a residual plot
3. If the plot is random, don’t transform the data.
4. Compute the Coefficient of Determination (R2)
5. Choose a Transformation method as mentioned in table in previous slide.
6. Transform IV or DV or both.
7. Apply Regression
8. If the Transformed R2 is greater than the previous score, the transformation is a
success.
Transformation Example
32
Objectives:
1. Create Linear Regression Model in Excel or R.
2. Find the Linear Regression Equation
3. Make Predictions
4. Create a Residual Plot and Find if Linear
Regression is an Absolute Fit or not.
5. If Not, Try Transformation 0
10
20
30
40
50
60
70
80
0 2 4 6 8 10
Scatterplot Y Vs X
33
2. Check for Normality
1. Linear Regression requires all the
variables need to be normal.
2. The normality can be checked using the
histogram or Q-Q Plots.
3. Test for Normality aka goodness of fit test
is called Kolmogorov Smirnov Test or
Shapiro Wilk Normality Test
4. If the Data is not normal a non linear
transformation ( e.g. Log Transformation)
can fix the issue.
5. Normality means that Y values are
normally distributed to each X.
34
3. Check for Multicollinearity
1. It means that the predictors are correlated with each other. Presence of correlation in
independent variables lead of Multicollinearity.
2. What happens if variables are correlated - it becomes difficult for the model to determine
the true effect of Independent Variables on Dependent.
3. Measure of Multicollinearity is given by VIF (Variable Influence Factor)
1. VIF tells us if the predictors are correlated, how much variance of an estimated coefficient
increases. If no factors are correlated, VIF will be 1.
2. If VIF is 1 - No Multicollinearity.
3. VIF>1, the predictors may be correlated.
4. VIF between 5 & 10 – Indicates high correlation.
5. VIF >10: Regression coefficients are poorly estimated due to Multicollinearity.
Solution:
1. To drop the variable showing high Collinearity. The presence of C suggests that the information
provided by this variable for the DV is redundant and is of no use.
2. Another Approach is to combined the collinear variables and create new predictor (for e.g. taking
average).
Plots Showing Heteroscedasticity
35
36
4. Heteroscedasticity
37
Image Source: Google
Meaning that Data has different dispersion. In other terms, it is called with unequal scatter.
Why it is a Problem
It is a problem because OLS Regression assumes that all residuals are drawn from
population that has constant variance ( Homoscedasticity). Ruins Results. Gives Biased
Coefficients.
Heteroscedasticity, also spelled heteroskedasticity, occurs more often in datasets that
have a large range between the largest and smallest observed values.
A classic example of heteroscedasticity is If you model household consumption based on
income, you’ll find that the variability in consumption increases as income increases.
Breusch-Pagan / Cook Weisberg Test - This test is used to determine presence of heteroskedasticity. If
you find p < 0.05, you reject the null hypothesis and infer that heteroskedasticity is present.
Scale Location Plot
38
• This plot is also useful to determine heteroskedasticity. Ideally, this plot shouldn't show any pattern.
• Presence of a pattern determine heteroskedasticity.
• Don't forget to corroborate the findings of this plot with the funnel shape in residual vs. fitted values.
•
Leverage Plot
39
Leverage: An observation with an extreme value on a predictor variable is called a point with high
leverage. Leverage is a measure of how far an observation deviates from the mean of that variable.
These leverage points can have an effect on the estimate of regression coefficients.
Summary of Tests in Python for Linear Regression Assumptions
40
Multicollinearity Test
from statsmodels.stats.outliers_influence
import variance_inflation_factor
Variance Inflation Factor
Normality Test
from scipy.stats import shapiro
Shapiro Wilk Test
Jarque Bera Test
Autocorrelation Test
Durbin Watson Test
Breusch Pagan Test
Heteroscedasticity Test
import statsmodels.stats.api as sts
Goldfeld Quandt Test
Breusch Pagan Test
Non Linearity Test
import statsmodels.stats.api as sts
Linear Rainbow Test
Auto Correlation of Residuals
41
Auto Correlation of Errors means that the errors are Correlated.
Assumption is that the Linear Regression Model Residuals are Not Correlated.
Test of Assumption – Durbin Watson Test
Package - statsmodels.stats.stattools.durbin_watson(resids, axis=0)
What is the Null Hypothesis
The Null Hypothesis of the test is that there is no serial correlation.
Statistics (Always between 0 and 4)
• The test statistic is equal to 2*(1-r) where r is the sample autocorrelation of the
residuals.
• Thus for r==0 indicating no serial correlation, the test statistic equals 2.
• Closer to 0, more evidence for positive serial correlation and closer to 4 indicates
negative serial correlation.
Assessing Goodness of Fit - R2
42
After fitting the model, it becomes essential to understand how well the model fits the data.
When the Model Fits Best on the Data?
A Model fits the data well if the difference between the actual value and the model’s predicted value is
small and unbiased.
What is R-Squared (R2)?
It is a statistical measure of how close the data is to the fitted regression line. The definition of R-
squared is fairly straight-forward; it is the percentage of the response variable variation that is
explained by a linear model. Or:
R-squared = Explained variation / Total variation
R-squared is always between 0 and 100%:
0% indicates that the model explains none of the variability of the response data around its mean.
100% indicates that the model explains all the variability of the response data around its mean.
In general, the higher the R-squared, the better the model fits your data.
Interview Questions
43
01. True-False: Linear Regression is a supervised machine learning algorithm.
A) TRUE
B) FALSE
Interview Questions - Solution
44
True-False: Linear Regression is a supervised machine learning algorithm.
A) TRUE
B) FALSE
Solution A: Yes, Linear regression is a supervised learning algorithm because it uses true labels
labels for training. Supervised learning algorithm should have input variable (x) and an
variable (Y) for each example.
Interview Questions
45
02. Which of the following methods do we use to find the best fit line for data in Linear
Regression?
A) Least Square Method
B) Maximum Likelihood
C) Both A and B
Interview Questions - Solution
46
02. Which of the following methods do we use to find the best fit line for data in Linear
Regression?
A) Least Square Method
B) Maximum Likelihood
C) Both A and B
Solution - A: In Linear Regression, we use the Least Square Method to identify the
Best Fit Line.
Interview Questions
47
03. Which of the following evaluation metrics can be used to evaluate a model while
modelling a continuous output variable?
A) AUC - ROC
B) Accuracy
C) Mean Squared Error
Interview Questions - Solution
48
03. Which of the following evaluation metrics can be used to evaluate a model while
modelling a continuous output variable?
A) AUC - ROC
B) Accuracy
C) Mean Squared Error
Solution - Since Linear Regression gives the output as continuous values and hence
we use Mean Squared Error metric to evaluate the model performance. Rest of the
options are used in case of classification problem.
Interview Questions
49
05. Which of the following statements is true about outliers in Linear Regression
A) Linear Regression is sensitive to outliers
B) Linear Regression is not sensitive to outliers
C) No Idea.
D) None of these
Interview Questions
50
05. Which of the following statements is true about outliers in Linear Regression
A) Linear Regression is sensitive to outliers
B) Linear Regression is not sensitive to outliers
C) No Idea.
D) None of these
Solution A: The slope of regression line will change if outliers are present in the data.
Therefore, it is sensitive to the Outliers.
Linear Regression Example
51
Objectives:
1. Create Linear Regression Model in Excel or R.
2. Find the Linear Regression Equation
3. Make Predictions
4. Create a Residual Plot and Find if Linear Regression is an Absolute Fit or not
x
1
2
3
4
5
Y
2
1
3.5
3
4.5
2
1
3.5
3
4.5
y = 0.7x + 0.7
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
0 1 2 3 4 5 6
Scatterplot X & Y
Assessing Model Fit
52
Residuals
The distance of the point (Actual Line) with the Line (Line of Prediction)
Root Mean Squared Error (RMSE)
“Residual Standard Error” in Linear Model Output. It is interpreted as how far on an
average, the residuals are from zero.
Mean Absolute Error(MAE)
Mean Absolute Error is another metric to evaluate the model. For example actual y is 10
and predictive y is 30. the resultant MAE would be (30-10) = 20. MAE is very robust
against the effect of Outliers.
R Square
This metric explains the percentage of variance in the model. It ranges between 0 and 1.
A higher value is always appreciated.
Since R Square increases as the new data is introduced (variables) regardless of the
fact that new variable is actually adding new information to the model. To overcome this,
we look upto Adj. R Square which is steady and only inc or dec if the newly added
variable is truly useful.
Adjusted R Square
Difference Between Linear and Logistic Regression
53
Linear Regression Logistic Regression
The Data is Modelled using a Straight Line
A statistical model that predicts the
probability of an outcome that can have two
values
The Outcome (Dependent Variable) is
continuous in Nature
The Outcome (Dependent Variable) has only
limited no of possible values
Output Variable is continuous Output Variable is Discrete
Used to Solve Regression Problems
Used to solve the classification problems
(Binary Classification)
Estimate the Dependent Variable when there
is a change in Independent Variable
Calculates Probability of Occurrence of an
Event
Linear regression uses ordinary least
squares method to minimise the errors and
arrive at a best possible fit
Logistic Regression uses maximum
likelihood method to arrive at the solution
Uses a Straight Line Uses a S Curve or Sigmoid Function
Example - Predicting Sales, House Prices,
GDP etc
Predicting if email is Spam or not, credit card
transaction is fraud or not or customer will
buy the product or not
Logistics
Regression
54
Logistics Regression
55
• Logistics Regression is used when the dependent variable is categorical.
• The values are strictly in the range of 0 and 1.
• It is used to describe data and to explain the relationship between one dependent
binary variable and one or more nominal, ordinal, interval or ratio- level independent
variables.
Logistics Regression Equation
56
• Here g() is the link function. The role of link function is to ‘link’ the expectation of y to linear predictor.
• E(y) is the expectation of target variable
• α + βx1 + γx2 – linear predictors. (α,β,γ to be predicted).
Fundamental equation of generalized linear model:
g(E(y)) = α + βx1 + γx2
Lets take a simple linear regression equation with dependent variable enclosed in a link function:
g(y) = βo + β(Age)
Here g() function is trying to establish probability of success (p) or probability of failure(1-p)
Criteria for p
• It must always be positive (p>=0)
• It must always be less than equal to 1 (p<=1)
p = exp(βo + β(Age))
Since probability must always be positive, we’ll put the linear equation in exponential form.
For any value of slope and dependent variable, exponent of this equation will never be negative.
Logistics Regression Equation
57
p = exp(βo + β(Age))
In order to make the probability less than 1, we must divide p by a number greater than p.
p = exp(βo + β(Age)) / exp(βo + β(Age))+1
Since g(y) = βo + β(Age) and p = exp(βo + β(Age)), this gives a new equation called
p = e^y / 1+ e^y
This is the Logit Function.
The probability of success is given by p = e^y / 1+ e^y and probability of failure is given by
q = 1-p or q = 1 - e^y / 1+ e^y
On dividing both the equations, we get the following:
Or
Logistics Regression equation
58
Final Equation
Log(p/1-p) is the link function and (p/1-p) is the odd ratio. If the log of odd ratio is found positive,
The probability of success is always more than 50%.
Sigmoid Function
59
• The Sigmoid Function also called Logistics Function
gives S shape that can take any real value and map
into a value between 0 and 1.
• The range of the values is between 0 and 1.
• If the output of the sigmoid function is more than
0.5, we classify the outcome as 1 or Yes.
• If the output of the sigmoid function is less than 0.5,
we classify the outcome as 0 or No.
ROC Curve
60
• ROC Is a probability curve and AUC represents degree or measure of separability.
• Higher the AUC better the model is.
• Here TPR is called the True Positive Rate aka Recall or Sensitivity and is given by TP /
(TP+FN).
• Specificity = TN / (TN + FP) and FPR (False Positive Rate) is 1 – Specificity.
What is a Confusion Matrix
Image Source: Google
Y
Actual
Values
Y hat (Predicted Values)
Lets Say we are predicting the presence of disease
which means yes – they have disease and no means –
they don’t.
1. The classifier made total 165 predictions.
2. Out of those 165 cases, the classifier predicted "yes"
110 times, and "no" 55 times.
3. In reality, 105 patients in the sample have the
disease, and 60 patients do not.
Lets understand the basic terms
• True positives
• True Negatives
• False positives
• False Negatives
Confusion Matrix
Performance of Logistic Regression Model
62
1. AIC (Akaike Information Criteria) – The analogous metric of adjusted R² in logistic
regression is AIC. AIC is the measure of fit which penalizes model for the number of
model coefficients. Therefore, we always prefer model with minimum AIC value.
2. Confusion Matrix : It is the tabular representation of Actual vs Predicted Values. It
helps us find the accuracy of the model.
The accuracy is calculated using the following equation
3. ROC Curve:
• The ROC Curve is known as Receiver Operating Characteristic Curve. It evaluates
the trade off between true positive rate and false positive rate.
• It is advisable to assume p>0.5 (threshold value) since, we are more concerned
about the success of the model.
• Higher the area under the curve, better the prediction power of the model would be.
Logistics Regression Assumptions
63
• Logistics Regression does not need any linear relationship between the dependent
and independent variables.
• The error (residuals) need not be normally distributed.
• There should be little to no multicollinearity amongst the independent variables.
• The outcome is binary variable like yes or no, 1 or 0, positive or negative etc.
• For a Binary Regression, the factor level 1 of the dependent variable should represent
the desired outcome.
• There is a linear relationship between the logit of the outcome and each predictor
variables. Recall that the logit function is logit(p) = log(p/(1-p)), where p is the
probabilities of the outcome.
• Logistics Regression requires quite large sample sizes.
Forward &
Backward
Selection
64
Forward Selection
65
1. Its a process which begins with empty model and keeps adding variables one by one.
2. Begins with Intercept. Tests are performed to find the relevant variables as ‘best’ variables.
3. The best variable shall return the highest coefficient of Determination or R-Squared Value.
4. This process keeps going and once the model no longer improves the accuracy by adding more
variables, the process stops.
5. Several Criterions are used to determine which variable goes in – lowest RMSE on cross
validation, F Test Score or lowest P Value.
Tip: While using Forward Selection, in order to test the accuracy of the model, its better to use the trained
Classifier against test data to make predictions.
Backward Selection
66
1. Its a process which begins with all variables and keeps removing predictors one by one.
2. Removing the variable with the largest p-value | meaning the variable that is least significant.
3. The new (p-1) variable model is a better model where the largest p value is removed.
4. This process keeps going and once the model has significant p value defined, we may stop the
process.
5. Several Criterions are used to determine which variable goes out – lowest RMSE on cross
validation, F Test Score or lowest P Value
Let’s review some concepts
Linear Regression Assumptions of
Linear Regression
Difference Between
Linear and Logistics
Regression
Logistics
Regression
Diagnostic Plots Forward and
Backward
Elimination
67
Thanks!
Any questions?
You can find me at Linkedin
@mkschauhan
mukul.mschauhan@gmail.com
68
https://www.linkedin.com/pulse/data-visualisation-using-seaborn-mukul-kr-singh-chauhan
https://www.linkedin.com/pulse/introduction-ggplot-series-mukul-kr-singh-chauhan/
https://www.linkedin.com/pulse/what-data-science-mukul-chauhan/

More Related Content

Similar to regression-linearandlogisitics-220524024037-4221a176 (1).pdf

Regression analysis
Regression analysisRegression analysis
Regression analysisSrikant001p
 
Chapter 2 part3-Least-Squares Regression
Chapter 2 part3-Least-Squares RegressionChapter 2 part3-Least-Squares Regression
Chapter 2 part3-Least-Squares Regressionnszakir
 
Stat 1163 -correlation and regression
Stat 1163 -correlation and regressionStat 1163 -correlation and regression
Stat 1163 -correlation and regressionKhulna University
 
Research method ch09 statistical methods 3 estimation np
Research method ch09 statistical methods 3 estimation npResearch method ch09 statistical methods 3 estimation np
Research method ch09 statistical methods 3 estimation npnaranbatn
 
Simple Regression Analysis ch12.pptx
Simple Regression Analysis ch12.pptxSimple Regression Analysis ch12.pptx
Simple Regression Analysis ch12.pptxSoumyaBansal7
 
Data Science - Part XII - Ridge Regression, LASSO, and Elastic Nets
Data Science - Part XII - Ridge Regression, LASSO, and Elastic NetsData Science - Part XII - Ridge Regression, LASSO, and Elastic Nets
Data Science - Part XII - Ridge Regression, LASSO, and Elastic NetsDerek Kane
 
What is Isotonic Regression and How Can a Business Utilize it to Analyze Data?
What is Isotonic Regression and How Can a Business Utilize it to Analyze Data?What is Isotonic Regression and How Can a Business Utilize it to Analyze Data?
What is Isotonic Regression and How Can a Business Utilize it to Analyze Data?Smarten Augmented Analytics
 
how to select the appropriate method for our study of interest
how to select the appropriate method for our study of interest how to select the appropriate method for our study of interest
how to select the appropriate method for our study of interest NurFathihaTahiatSeeu
 
Data Science - Part IV - Regression Analysis & ANOVA
Data Science - Part IV - Regression Analysis & ANOVAData Science - Part IV - Regression Analysis & ANOVA
Data Science - Part IV - Regression Analysis & ANOVADerek Kane
 
Get Multiple Regression Assignment Help
Get Multiple Regression Assignment Help Get Multiple Regression Assignment Help
Get Multiple Regression Assignment Help HelpWithAssignment.com
 
8 correlation regression
8 correlation regression 8 correlation regression
8 correlation regression Penny Jiang
 
Chapter 9 Regression
Chapter 9 RegressionChapter 9 Regression
Chapter 9 Regressionghalan
 
correlation and regression
correlation and regressioncorrelation and regression
correlation and regressionFaiezah Zulkifli
 
Introduction to correlation and regression analysis
Introduction to correlation and regression analysisIntroduction to correlation and regression analysis
Introduction to correlation and regression analysisFarzad Javidanrad
 
linearregression-1909240jhgg53948.pptx
linearregression-1909240jhgg53948.pptxlinearregression-1909240jhgg53948.pptx
linearregression-1909240jhgg53948.pptxbishalnandi2
 

Similar to regression-linearandlogisitics-220524024037-4221a176 (1).pdf (20)

Regression analysis
Regression analysisRegression analysis
Regression analysis
 
Chapter 2 part3-Least-Squares Regression
Chapter 2 part3-Least-Squares RegressionChapter 2 part3-Least-Squares Regression
Chapter 2 part3-Least-Squares Regression
 
Stat 1163 -correlation and regression
Stat 1163 -correlation and regressionStat 1163 -correlation and regression
Stat 1163 -correlation and regression
 
Research method ch09 statistical methods 3 estimation np
Research method ch09 statistical methods 3 estimation npResearch method ch09 statistical methods 3 estimation np
Research method ch09 statistical methods 3 estimation np
 
Simple linear regression
Simple linear regressionSimple linear regression
Simple linear regression
 
Chapter 10
Chapter 10Chapter 10
Chapter 10
 
Chapter 10
Chapter 10Chapter 10
Chapter 10
 
Simple Regression Analysis ch12.pptx
Simple Regression Analysis ch12.pptxSimple Regression Analysis ch12.pptx
Simple Regression Analysis ch12.pptx
 
Data Science - Part XII - Ridge Regression, LASSO, and Elastic Nets
Data Science - Part XII - Ridge Regression, LASSO, and Elastic NetsData Science - Part XII - Ridge Regression, LASSO, and Elastic Nets
Data Science - Part XII - Ridge Regression, LASSO, and Elastic Nets
 
What is Isotonic Regression and How Can a Business Utilize it to Analyze Data?
What is Isotonic Regression and How Can a Business Utilize it to Analyze Data?What is Isotonic Regression and How Can a Business Utilize it to Analyze Data?
What is Isotonic Regression and How Can a Business Utilize it to Analyze Data?
 
how to select the appropriate method for our study of interest
how to select the appropriate method for our study of interest how to select the appropriate method for our study of interest
how to select the appropriate method for our study of interest
 
Correlation and Regression
Correlation and Regression Correlation and Regression
Correlation and Regression
 
Data Science - Part IV - Regression Analysis & ANOVA
Data Science - Part IV - Regression Analysis & ANOVAData Science - Part IV - Regression Analysis & ANOVA
Data Science - Part IV - Regression Analysis & ANOVA
 
Get Multiple Regression Assignment Help
Get Multiple Regression Assignment Help Get Multiple Regression Assignment Help
Get Multiple Regression Assignment Help
 
Regression ppt
Regression pptRegression ppt
Regression ppt
 
8 correlation regression
8 correlation regression 8 correlation regression
8 correlation regression
 
Chapter 9 Regression
Chapter 9 RegressionChapter 9 Regression
Chapter 9 Regression
 
correlation and regression
correlation and regressioncorrelation and regression
correlation and regression
 
Introduction to correlation and regression analysis
Introduction to correlation and regression analysisIntroduction to correlation and regression analysis
Introduction to correlation and regression analysis
 
linearregression-1909240jhgg53948.pptx
linearregression-1909240jhgg53948.pptxlinearregression-1909240jhgg53948.pptx
linearregression-1909240jhgg53948.pptx
 

Recently uploaded

Non Text Magic Studio Magic Design for Presentations L&P.pptx
Non Text Magic Studio Magic Design for Presentations L&P.pptxNon Text Magic Studio Magic Design for Presentations L&P.pptx
Non Text Magic Studio Magic Design for Presentations L&P.pptxAbhayThakur200703
 
Intro to BCG's Carbon Emissions Benchmark_vF.pdf
Intro to BCG's Carbon Emissions Benchmark_vF.pdfIntro to BCG's Carbon Emissions Benchmark_vF.pdf
Intro to BCG's Carbon Emissions Benchmark_vF.pdfpollardmorgan
 
Lowrate Call Girls In Laxmi Nagar Delhi ❤️8860477959 Escorts 100% Genuine Ser...
Lowrate Call Girls In Laxmi Nagar Delhi ❤️8860477959 Escorts 100% Genuine Ser...Lowrate Call Girls In Laxmi Nagar Delhi ❤️8860477959 Escorts 100% Genuine Ser...
Lowrate Call Girls In Laxmi Nagar Delhi ❤️8860477959 Escorts 100% Genuine Ser...lizamodels9
 
M.C Lodges -- Guest House in Jhang.
M.C Lodges --  Guest House in Jhang.M.C Lodges --  Guest House in Jhang.
M.C Lodges -- Guest House in Jhang.Aaiza Hassan
 
The CMO Survey - Highlights and Insights Report - Spring 2024
The CMO Survey - Highlights and Insights Report - Spring 2024The CMO Survey - Highlights and Insights Report - Spring 2024
The CMO Survey - Highlights and Insights Report - Spring 2024christinemoorman
 
0183760ssssssssssssssssssssssssssss00101011 (27).pdf
0183760ssssssssssssssssssssssssssss00101011 (27).pdf0183760ssssssssssssssssssssssssssss00101011 (27).pdf
0183760ssssssssssssssssssssssssssss00101011 (27).pdfRenandantas16
 
BEST Call Girls In Greater Noida ✨ 9773824855 ✨ Escorts Service In Delhi Ncr,
BEST Call Girls In Greater Noida ✨ 9773824855 ✨ Escorts Service In Delhi Ncr,BEST Call Girls In Greater Noida ✨ 9773824855 ✨ Escorts Service In Delhi Ncr,
BEST Call Girls In Greater Noida ✨ 9773824855 ✨ Escorts Service In Delhi Ncr,noida100girls
 
Keppel Ltd. 1Q 2024 Business Update Presentation Slides
Keppel Ltd. 1Q 2024 Business Update  Presentation SlidesKeppel Ltd. 1Q 2024 Business Update  Presentation Slides
Keppel Ltd. 1Q 2024 Business Update Presentation SlidesKeppelCorporation
 
Insurers' journeys to build a mastery in the IoT usage
Insurers' journeys to build a mastery in the IoT usageInsurers' journeys to build a mastery in the IoT usage
Insurers' journeys to build a mastery in the IoT usageMatteo Carbone
 
Progress Report - Oracle Database Analyst Summit
Progress  Report - Oracle Database Analyst SummitProgress  Report - Oracle Database Analyst Summit
Progress Report - Oracle Database Analyst SummitHolger Mueller
 
Tech Startup Growth Hacking 101 - Basics on Growth Marketing
Tech Startup Growth Hacking 101  - Basics on Growth MarketingTech Startup Growth Hacking 101  - Basics on Growth Marketing
Tech Startup Growth Hacking 101 - Basics on Growth MarketingShawn Pang
 
GD Birla and his contribution in management
GD Birla and his contribution in managementGD Birla and his contribution in management
GD Birla and his contribution in managementchhavia330
 
Enhancing and Restoring Safety & Quality Cultures - Dave Litwiller - May 2024...
Enhancing and Restoring Safety & Quality Cultures - Dave Litwiller - May 2024...Enhancing and Restoring Safety & Quality Cultures - Dave Litwiller - May 2024...
Enhancing and Restoring Safety & Quality Cultures - Dave Litwiller - May 2024...Dave Litwiller
 
Pharma Works Profile of Karan Communications
Pharma Works Profile of Karan CommunicationsPharma Works Profile of Karan Communications
Pharma Works Profile of Karan Communicationskarancommunications
 
Call Girls In Radisson Blu Hotel New Delhi Paschim Vihar ❤️8860477959 Escorts...
Call Girls In Radisson Blu Hotel New Delhi Paschim Vihar ❤️8860477959 Escorts...Call Girls In Radisson Blu Hotel New Delhi Paschim Vihar ❤️8860477959 Escorts...
Call Girls In Radisson Blu Hotel New Delhi Paschim Vihar ❤️8860477959 Escorts...lizamodels9
 
Lucknow 💋 Escorts in Lucknow - 450+ Call Girl Cash Payment 8923113531 Neha Th...
Lucknow 💋 Escorts in Lucknow - 450+ Call Girl Cash Payment 8923113531 Neha Th...Lucknow 💋 Escorts in Lucknow - 450+ Call Girl Cash Payment 8923113531 Neha Th...
Lucknow 💋 Escorts in Lucknow - 450+ Call Girl Cash Payment 8923113531 Neha Th...anilsa9823
 
Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...
Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...
Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...lizamodels9
 
Monte Carlo simulation : Simulation using MCSM
Monte Carlo simulation : Simulation using MCSMMonte Carlo simulation : Simulation using MCSM
Monte Carlo simulation : Simulation using MCSMRavindra Nath Shukla
 
rishikeshgirls.in- Rishikesh call girl.pdf
rishikeshgirls.in- Rishikesh call girl.pdfrishikeshgirls.in- Rishikesh call girl.pdf
rishikeshgirls.in- Rishikesh call girl.pdfmuskan1121w
 
Call Girls in Mehrauli Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Mehrauli Delhi 💯Call Us 🔝8264348440🔝Call Girls in Mehrauli Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Mehrauli Delhi 💯Call Us 🔝8264348440🔝soniya singh
 

Recently uploaded (20)

Non Text Magic Studio Magic Design for Presentations L&P.pptx
Non Text Magic Studio Magic Design for Presentations L&P.pptxNon Text Magic Studio Magic Design for Presentations L&P.pptx
Non Text Magic Studio Magic Design for Presentations L&P.pptx
 
Intro to BCG's Carbon Emissions Benchmark_vF.pdf
Intro to BCG's Carbon Emissions Benchmark_vF.pdfIntro to BCG's Carbon Emissions Benchmark_vF.pdf
Intro to BCG's Carbon Emissions Benchmark_vF.pdf
 
Lowrate Call Girls In Laxmi Nagar Delhi ❤️8860477959 Escorts 100% Genuine Ser...
Lowrate Call Girls In Laxmi Nagar Delhi ❤️8860477959 Escorts 100% Genuine Ser...Lowrate Call Girls In Laxmi Nagar Delhi ❤️8860477959 Escorts 100% Genuine Ser...
Lowrate Call Girls In Laxmi Nagar Delhi ❤️8860477959 Escorts 100% Genuine Ser...
 
M.C Lodges -- Guest House in Jhang.
M.C Lodges --  Guest House in Jhang.M.C Lodges --  Guest House in Jhang.
M.C Lodges -- Guest House in Jhang.
 
The CMO Survey - Highlights and Insights Report - Spring 2024
The CMO Survey - Highlights and Insights Report - Spring 2024The CMO Survey - Highlights and Insights Report - Spring 2024
The CMO Survey - Highlights and Insights Report - Spring 2024
 
0183760ssssssssssssssssssssssssssss00101011 (27).pdf
0183760ssssssssssssssssssssssssssss00101011 (27).pdf0183760ssssssssssssssssssssssssssss00101011 (27).pdf
0183760ssssssssssssssssssssssssssss00101011 (27).pdf
 
BEST Call Girls In Greater Noida ✨ 9773824855 ✨ Escorts Service In Delhi Ncr,
BEST Call Girls In Greater Noida ✨ 9773824855 ✨ Escorts Service In Delhi Ncr,BEST Call Girls In Greater Noida ✨ 9773824855 ✨ Escorts Service In Delhi Ncr,
BEST Call Girls In Greater Noida ✨ 9773824855 ✨ Escorts Service In Delhi Ncr,
 
Keppel Ltd. 1Q 2024 Business Update Presentation Slides
Keppel Ltd. 1Q 2024 Business Update  Presentation SlidesKeppel Ltd. 1Q 2024 Business Update  Presentation Slides
Keppel Ltd. 1Q 2024 Business Update Presentation Slides
 
Insurers' journeys to build a mastery in the IoT usage
Insurers' journeys to build a mastery in the IoT usageInsurers' journeys to build a mastery in the IoT usage
Insurers' journeys to build a mastery in the IoT usage
 
Progress Report - Oracle Database Analyst Summit
Progress  Report - Oracle Database Analyst SummitProgress  Report - Oracle Database Analyst Summit
Progress Report - Oracle Database Analyst Summit
 
Tech Startup Growth Hacking 101 - Basics on Growth Marketing
Tech Startup Growth Hacking 101  - Basics on Growth MarketingTech Startup Growth Hacking 101  - Basics on Growth Marketing
Tech Startup Growth Hacking 101 - Basics on Growth Marketing
 
GD Birla and his contribution in management
GD Birla and his contribution in managementGD Birla and his contribution in management
GD Birla and his contribution in management
 
Enhancing and Restoring Safety & Quality Cultures - Dave Litwiller - May 2024...
Enhancing and Restoring Safety & Quality Cultures - Dave Litwiller - May 2024...Enhancing and Restoring Safety & Quality Cultures - Dave Litwiller - May 2024...
Enhancing and Restoring Safety & Quality Cultures - Dave Litwiller - May 2024...
 
Pharma Works Profile of Karan Communications
Pharma Works Profile of Karan CommunicationsPharma Works Profile of Karan Communications
Pharma Works Profile of Karan Communications
 
Call Girls In Radisson Blu Hotel New Delhi Paschim Vihar ❤️8860477959 Escorts...
Call Girls In Radisson Blu Hotel New Delhi Paschim Vihar ❤️8860477959 Escorts...Call Girls In Radisson Blu Hotel New Delhi Paschim Vihar ❤️8860477959 Escorts...
Call Girls In Radisson Blu Hotel New Delhi Paschim Vihar ❤️8860477959 Escorts...
 
Lucknow 💋 Escorts in Lucknow - 450+ Call Girl Cash Payment 8923113531 Neha Th...
Lucknow 💋 Escorts in Lucknow - 450+ Call Girl Cash Payment 8923113531 Neha Th...Lucknow 💋 Escorts in Lucknow - 450+ Call Girl Cash Payment 8923113531 Neha Th...
Lucknow 💋 Escorts in Lucknow - 450+ Call Girl Cash Payment 8923113531 Neha Th...
 
Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...
Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...
Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...
 
Monte Carlo simulation : Simulation using MCSM
Monte Carlo simulation : Simulation using MCSMMonte Carlo simulation : Simulation using MCSM
Monte Carlo simulation : Simulation using MCSM
 
rishikeshgirls.in- Rishikesh call girl.pdf
rishikeshgirls.in- Rishikesh call girl.pdfrishikeshgirls.in- Rishikesh call girl.pdf
rishikeshgirls.in- Rishikesh call girl.pdf
 
Call Girls in Mehrauli Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Mehrauli Delhi 💯Call Us 🔝8264348440🔝Call Girls in Mehrauli Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Mehrauli Delhi 💯Call Us 🔝8264348440🔝
 

regression-linearandlogisitics-220524024037-4221a176 (1).pdf

  • 2. Agenda for Today’s Session WHAT IS REGRESSION We will understand the Regression and will move onto the types. REGRESSION DIAGNOSTICS Lets find what best works with Linear Regression 2 LINEAR REGRESSION vs LOGISITIC Lets find out the difference between Linear & Logistic Regression. UNDERSTANDING LINEAR REGRESSION ALGORITHM ASSUMPTIONS Lets Look into the Assumptions & Violations UNDERSTANDING LOGISTIC REGRESSION Lets Look into the Algorithm & understand how it functions. INTERVIEW QUESTIONS Covering some interview questions to strengthen the knowledge and give an idea about Interviews too.
  • 5. Understanding Linear Regression Algorithm 5 Establishes a relationship between the Independent & Dependent Variables. Examples of Independent & Dependent Variables:- • x is Rainfall and y is Crop Yield • x is Advertising Expense and y is Sales • x is sales of goods and y is GDP Here x is Independent Variable & Y is Dependent Variable Intro How it Works • Regression analysis is used to understand which among the Independent Variables are related to Dependent Variables. • It attempts to model relationship between two variables by fitting a line called Linear Regression Line. • The case of Single variable is called Simple Linear Regression where as the case of Multiple Independent Variables, it is called Multiple Linear Regression
  • 6. 6 Single Linear Regression Vs Multiple Linear Regression The Linear Regression line is created using Ordinary Least Square Method. X Y Simple Linear Regression Multiple Linear Regression X1 Y X2 X3 X4 Multiple Predictors
  • 7. Linear Regression Equation 7 y = mx + c Slope/Gradient Y Intercept
  • 8. Sum of Squared Error What is Error? • Actual Value – Predicted Value is called Error • Here Predicted Value is the value predicted by the Linear Regression Model. • Also known as Residual. Why it is Important? Smaller the residuals, more accurate model it would be.
  • 9. Regression Line | Best Fit Line What is the Line of Best Fit? • The Best Fit Line is the line that gives the minimum SSE. • Amongst all the possible lines, there will be one line that will be the best fit meaning greatest possible accuracy as a model. • The line that minimizes the sum of squared error of residuals is called Regression Line or the Best Fit Line. • In Simple Terms, it represents a straight line that best represents the data on scatterplot. It is drawn using the Least Square Method. Use the Least Square Method to Determine the Eqn. of Line of Best Fit x 8 2 11 6 5 4 12 9 6 1 y 3 10 3 6 8 12 1 4 9 14
  • 10. Finding Best Fit Line Algorithm How to find the Best Fit Line? • The equation of the straight line is given by y = mx +c m – slope of the line c - Intercept (The point at which the straight line touches y axis. • The Best Fit line is found basis the Least Squared Method. Algorithm: Step1: Find the Mean of x-values and y-values Step2: Calculate the slope of the line. It can be found using the following eqn. on the right. Step3: Compute the y-intercept of the line using the formula Mean of x and y values Finding m (Slope)
  • 11. Regression Line | Best Fit Line Use the Least Square Method to Determine the Eqn. of Line of Best Fit Steps: Following Steps are deployed to achieve the objective. Step1: Calculate the Mean of X and Ys Step2: Find the Following:-
  • 12. Regression Line | Best Fit Line Step1: Calculate the Mean of X and Ys The mean value of x is 6.4 and y is 7 Step2: Find the m (slope) m (slope) = -131/118.4 = -1.1 approx. Step3: Calculate the y intercept b (intercept) = 7 – (-1.1) * 6.4 = 14.0 approx. Thus, the equation of the line is y = -1.1x + 14
  • 13. Regression Equation and Line of Best Fit
  • 14. R Squared 1. R Squared is a statistical measure that represents the proportion of the variance for a DV explained by IV. 2. While correlation defines the strength, R Squared explains up to what extent the variance of one variable explains the variance of another var. 3. Example – In Investing, the R Squared is %age of a fund movement that can be explained by the movement of benchmark(sensex) 4. Aka Coefficient of Determination.
  • 16. How to see if the Assumptions are Violated – Deciding if Linear Model is a good Fit 16 Residual vs Fitted Values Plot 1. The x-axis has the fitted values and y axis has the Residuals. 2. Residual = Observed y value – Predicted y value. 3. Vertical distance of actual point vs line of the best fit is called Residual. 4. If unsure about the shape (curve) for regression equation by looking into the scatterplot, a residual plot helps in making decision. When a pattern is observed in a residual plot, a linear regression model is probably not appropriate for your data. Data should be randomly scattered around line 0
  • 17. Normal Q-Q Plot (Quantile Quantile Plot) 17 1. If the data is normally distributed, the points in the QQ-normal plot lie on a straight diagonal line. 2. Greater the departure from this reference line, the greater the evidence that the data is not following the normal distribution pattern.
  • 18. Interview Questions 18 01. Which of the following is true about Residuals? A) Lower is Better B) Higher is Better C) A or B depends on the situation. D) None of these
  • 19. Interview Questions - Solution 19 01. Which of the following is true about Residuals? A) Lower is Better B) Higher is Better C) A or B depends on the situation. D) None of these Solution A: Residuals refer to the error value of the model. Hence, Lower is Better. ( Residual = y – yhat)
  • 20. Interview Questions 20 02. Which of the statement is true regarding residuals in regression A) Mean of the Residuals is always zero. B) Mean of the Residuals is always less than zero. C) Mean of the Residuals is always more than zero. D) There is no such rule for residuals.
  • 21. Interview Questions - Solution 21 02. Which of the statement is true regarding residuals in regression A) Mean of the Residuals is always zero. B) Mean of the Residuals is always less than zero. C) Mean of the Residuals is always more than zero. D) There is no such rule for residuals. Solution: A Sum of residual in regression is always zero. It the sum of residuals is zero, the ‘Mean’ will also be zero.
  • 22. Interview Questions 22 03. To Test linear relationship of y (dependent) and x (independent) continuous variable, which of the following plots are best suited. A) Scatterplot B) Barplot C) Histograms D) None of These.
  • 23. Interview Questions - Solution 23 03. To Test linear relationship of y (dependent) and x (independent) continuous variable, which of the following plots are best suited. A) Scatterplot B) Barplot C) Histograms D) None of These. Solution: A To test the linear relationship between continuous variables Scatter plot is a good option. We can find out how one variable is changing w.r.t. another variable. A scatter plot displays the relationship between two quantitative variables.
  • 24. Interview Questions 24 04. A Correlation between the age and health of a person found to be -1.09. On the basis of this you would tell the doctors that: A) The age is good predictor of health B) The age is poor predictor of health C) None of These.
  • 25. Interview Questions - Solution 25 04. A Correlation between the age and health of a person found to be -1.09. On the basis of this you would tell the doctors that: A) The age is good predictor of health B) The age is poor predictor of health C) None of These. Solution: C Correlation coefficient range is between [-1 ,1]. So -1.09 is not possible.
  • 26. Interview Questions 26 05. Which of the following offsets, do we use in case of least square line fit? Suppose horizontal axis is independent variable and vertical axis is dependent variable. A) Vertical Offset B) Perpendicular Offset C) Both D) None of the Above.
  • 27. Interview Questions - Solution 27 05. Which of the following offsets, do we use in case of least square line fit? Suppose horizontal axis is independent variable and vertical axis is dependent variable. A) Vertical Offset B) Perpendicular Offset C) Both D) None of the Above. Solution: A We always consider residual as vertical offsets. Perpendicular offset are useful in case of PCA.
  • 28. 28 Linear Regression – Model Assumptions Since Linear Regression assesses whether one or more predictor variables explain the dependent Variable and hence it has 05 assumptions: 1. Linear Relationship 2. Normality 3. No or Little Multicollinearity 4. No Auto Correlation in errors 5. Homoscedasticity Note on sample size – The sample size thumb rule is that regression analysis requires at least 20 data points per independent variable in analysis.
  • 29. 29 1. Check for Linearity 1. Linear Regression needs the relationship between the independent & dependent variable to linear & additive. 2. Being additive means the effect of x on y is independent of other variables. 3. The linearity can be checked using the scatter plots. 4. Some examples are shown on right. It shows little to no correlation
  • 30. Transforming Variables to Achieve Linearity 30 • Each Row shows a different transformation method. • Transform column shows the method of transformation to be applied on DV or IV. • Regression equation is the equation used in analysis. • Last Column shows the equation of Prediction.
  • 31. Non Linear to Linear Conversion 31 The best transformation depends of the data & the best model will give the highest coefficient of Determination. Steps Involved:- 1. Create Linear Regression Model. 2. Construct a residual plot 3. If the plot is random, don’t transform the data. 4. Compute the Coefficient of Determination (R2) 5. Choose a Transformation method as mentioned in table in previous slide. 6. Transform IV or DV or both. 7. Apply Regression 8. If the Transformed R2 is greater than the previous score, the transformation is a success.
  • 32. Transformation Example 32 Objectives: 1. Create Linear Regression Model in Excel or R. 2. Find the Linear Regression Equation 3. Make Predictions 4. Create a Residual Plot and Find if Linear Regression is an Absolute Fit or not. 5. If Not, Try Transformation 0 10 20 30 40 50 60 70 80 0 2 4 6 8 10 Scatterplot Y Vs X
  • 33. 33 2. Check for Normality 1. Linear Regression requires all the variables need to be normal. 2. The normality can be checked using the histogram or Q-Q Plots. 3. Test for Normality aka goodness of fit test is called Kolmogorov Smirnov Test or Shapiro Wilk Normality Test 4. If the Data is not normal a non linear transformation ( e.g. Log Transformation) can fix the issue. 5. Normality means that Y values are normally distributed to each X.
  • 34. 34 3. Check for Multicollinearity 1. It means that the predictors are correlated with each other. Presence of correlation in independent variables lead of Multicollinearity. 2. What happens if variables are correlated - it becomes difficult for the model to determine the true effect of Independent Variables on Dependent. 3. Measure of Multicollinearity is given by VIF (Variable Influence Factor) 1. VIF tells us if the predictors are correlated, how much variance of an estimated coefficient increases. If no factors are correlated, VIF will be 1. 2. If VIF is 1 - No Multicollinearity. 3. VIF>1, the predictors may be correlated. 4. VIF between 5 & 10 – Indicates high correlation. 5. VIF >10: Regression coefficients are poorly estimated due to Multicollinearity. Solution: 1. To drop the variable showing high Collinearity. The presence of C suggests that the information provided by this variable for the DV is redundant and is of no use. 2. Another Approach is to combined the collinear variables and create new predictor (for e.g. taking average).
  • 36. 36
  • 37. 4. Heteroscedasticity 37 Image Source: Google Meaning that Data has different dispersion. In other terms, it is called with unequal scatter. Why it is a Problem It is a problem because OLS Regression assumes that all residuals are drawn from population that has constant variance ( Homoscedasticity). Ruins Results. Gives Biased Coefficients. Heteroscedasticity, also spelled heteroskedasticity, occurs more often in datasets that have a large range between the largest and smallest observed values. A classic example of heteroscedasticity is If you model household consumption based on income, you’ll find that the variability in consumption increases as income increases. Breusch-Pagan / Cook Weisberg Test - This test is used to determine presence of heteroskedasticity. If you find p < 0.05, you reject the null hypothesis and infer that heteroskedasticity is present.
  • 38. Scale Location Plot 38 • This plot is also useful to determine heteroskedasticity. Ideally, this plot shouldn't show any pattern. • Presence of a pattern determine heteroskedasticity. • Don't forget to corroborate the findings of this plot with the funnel shape in residual vs. fitted values. •
  • 39. Leverage Plot 39 Leverage: An observation with an extreme value on a predictor variable is called a point with high leverage. Leverage is a measure of how far an observation deviates from the mean of that variable. These leverage points can have an effect on the estimate of regression coefficients.
  • 40. Summary of Tests in Python for Linear Regression Assumptions 40 Multicollinearity Test from statsmodels.stats.outliers_influence import variance_inflation_factor Variance Inflation Factor Normality Test from scipy.stats import shapiro Shapiro Wilk Test Jarque Bera Test Autocorrelation Test Durbin Watson Test Breusch Pagan Test Heteroscedasticity Test import statsmodels.stats.api as sts Goldfeld Quandt Test Breusch Pagan Test Non Linearity Test import statsmodels.stats.api as sts Linear Rainbow Test
  • 41. Auto Correlation of Residuals 41 Auto Correlation of Errors means that the errors are Correlated. Assumption is that the Linear Regression Model Residuals are Not Correlated. Test of Assumption – Durbin Watson Test Package - statsmodels.stats.stattools.durbin_watson(resids, axis=0) What is the Null Hypothesis The Null Hypothesis of the test is that there is no serial correlation. Statistics (Always between 0 and 4) • The test statistic is equal to 2*(1-r) where r is the sample autocorrelation of the residuals. • Thus for r==0 indicating no serial correlation, the test statistic equals 2. • Closer to 0, more evidence for positive serial correlation and closer to 4 indicates negative serial correlation.
  • 42. Assessing Goodness of Fit - R2 42 After fitting the model, it becomes essential to understand how well the model fits the data. When the Model Fits Best on the Data? A Model fits the data well if the difference between the actual value and the model’s predicted value is small and unbiased. What is R-Squared (R2)? It is a statistical measure of how close the data is to the fitted regression line. The definition of R- squared is fairly straight-forward; it is the percentage of the response variable variation that is explained by a linear model. Or: R-squared = Explained variation / Total variation R-squared is always between 0 and 100%: 0% indicates that the model explains none of the variability of the response data around its mean. 100% indicates that the model explains all the variability of the response data around its mean. In general, the higher the R-squared, the better the model fits your data.
  • 43. Interview Questions 43 01. True-False: Linear Regression is a supervised machine learning algorithm. A) TRUE B) FALSE
  • 44. Interview Questions - Solution 44 True-False: Linear Regression is a supervised machine learning algorithm. A) TRUE B) FALSE Solution A: Yes, Linear regression is a supervised learning algorithm because it uses true labels labels for training. Supervised learning algorithm should have input variable (x) and an variable (Y) for each example.
  • 45. Interview Questions 45 02. Which of the following methods do we use to find the best fit line for data in Linear Regression? A) Least Square Method B) Maximum Likelihood C) Both A and B
  • 46. Interview Questions - Solution 46 02. Which of the following methods do we use to find the best fit line for data in Linear Regression? A) Least Square Method B) Maximum Likelihood C) Both A and B Solution - A: In Linear Regression, we use the Least Square Method to identify the Best Fit Line.
  • 47. Interview Questions 47 03. Which of the following evaluation metrics can be used to evaluate a model while modelling a continuous output variable? A) AUC - ROC B) Accuracy C) Mean Squared Error
  • 48. Interview Questions - Solution 48 03. Which of the following evaluation metrics can be used to evaluate a model while modelling a continuous output variable? A) AUC - ROC B) Accuracy C) Mean Squared Error Solution - Since Linear Regression gives the output as continuous values and hence we use Mean Squared Error metric to evaluate the model performance. Rest of the options are used in case of classification problem.
  • 49. Interview Questions 49 05. Which of the following statements is true about outliers in Linear Regression A) Linear Regression is sensitive to outliers B) Linear Regression is not sensitive to outliers C) No Idea. D) None of these
  • 50. Interview Questions 50 05. Which of the following statements is true about outliers in Linear Regression A) Linear Regression is sensitive to outliers B) Linear Regression is not sensitive to outliers C) No Idea. D) None of these Solution A: The slope of regression line will change if outliers are present in the data. Therefore, it is sensitive to the Outliers.
  • 51. Linear Regression Example 51 Objectives: 1. Create Linear Regression Model in Excel or R. 2. Find the Linear Regression Equation 3. Make Predictions 4. Create a Residual Plot and Find if Linear Regression is an Absolute Fit or not x 1 2 3 4 5 Y 2 1 3.5 3 4.5 2 1 3.5 3 4.5 y = 0.7x + 0.7 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 0 1 2 3 4 5 6 Scatterplot X & Y
  • 52. Assessing Model Fit 52 Residuals The distance of the point (Actual Line) with the Line (Line of Prediction) Root Mean Squared Error (RMSE) “Residual Standard Error” in Linear Model Output. It is interpreted as how far on an average, the residuals are from zero. Mean Absolute Error(MAE) Mean Absolute Error is another metric to evaluate the model. For example actual y is 10 and predictive y is 30. the resultant MAE would be (30-10) = 20. MAE is very robust against the effect of Outliers. R Square This metric explains the percentage of variance in the model. It ranges between 0 and 1. A higher value is always appreciated. Since R Square increases as the new data is introduced (variables) regardless of the fact that new variable is actually adding new information to the model. To overcome this, we look upto Adj. R Square which is steady and only inc or dec if the newly added variable is truly useful. Adjusted R Square
  • 53. Difference Between Linear and Logistic Regression 53 Linear Regression Logistic Regression The Data is Modelled using a Straight Line A statistical model that predicts the probability of an outcome that can have two values The Outcome (Dependent Variable) is continuous in Nature The Outcome (Dependent Variable) has only limited no of possible values Output Variable is continuous Output Variable is Discrete Used to Solve Regression Problems Used to solve the classification problems (Binary Classification) Estimate the Dependent Variable when there is a change in Independent Variable Calculates Probability of Occurrence of an Event Linear regression uses ordinary least squares method to minimise the errors and arrive at a best possible fit Logistic Regression uses maximum likelihood method to arrive at the solution Uses a Straight Line Uses a S Curve or Sigmoid Function Example - Predicting Sales, House Prices, GDP etc Predicting if email is Spam or not, credit card transaction is fraud or not or customer will buy the product or not
  • 55. Logistics Regression 55 • Logistics Regression is used when the dependent variable is categorical. • The values are strictly in the range of 0 and 1. • It is used to describe data and to explain the relationship between one dependent binary variable and one or more nominal, ordinal, interval or ratio- level independent variables.
  • 56. Logistics Regression Equation 56 • Here g() is the link function. The role of link function is to ‘link’ the expectation of y to linear predictor. • E(y) is the expectation of target variable • α + βx1 + γx2 – linear predictors. (α,β,γ to be predicted). Fundamental equation of generalized linear model: g(E(y)) = α + βx1 + γx2 Lets take a simple linear regression equation with dependent variable enclosed in a link function: g(y) = βo + β(Age) Here g() function is trying to establish probability of success (p) or probability of failure(1-p) Criteria for p • It must always be positive (p>=0) • It must always be less than equal to 1 (p<=1) p = exp(βo + β(Age)) Since probability must always be positive, we’ll put the linear equation in exponential form. For any value of slope and dependent variable, exponent of this equation will never be negative.
  • 57. Logistics Regression Equation 57 p = exp(βo + β(Age)) In order to make the probability less than 1, we must divide p by a number greater than p. p = exp(βo + β(Age)) / exp(βo + β(Age))+1 Since g(y) = βo + β(Age) and p = exp(βo + β(Age)), this gives a new equation called p = e^y / 1+ e^y This is the Logit Function. The probability of success is given by p = e^y / 1+ e^y and probability of failure is given by q = 1-p or q = 1 - e^y / 1+ e^y On dividing both the equations, we get the following: Or
  • 58. Logistics Regression equation 58 Final Equation Log(p/1-p) is the link function and (p/1-p) is the odd ratio. If the log of odd ratio is found positive, The probability of success is always more than 50%.
  • 59. Sigmoid Function 59 • The Sigmoid Function also called Logistics Function gives S shape that can take any real value and map into a value between 0 and 1. • The range of the values is between 0 and 1. • If the output of the sigmoid function is more than 0.5, we classify the outcome as 1 or Yes. • If the output of the sigmoid function is less than 0.5, we classify the outcome as 0 or No.
  • 60. ROC Curve 60 • ROC Is a probability curve and AUC represents degree or measure of separability. • Higher the AUC better the model is. • Here TPR is called the True Positive Rate aka Recall or Sensitivity and is given by TP / (TP+FN). • Specificity = TN / (TN + FP) and FPR (False Positive Rate) is 1 – Specificity.
  • 61. What is a Confusion Matrix Image Source: Google Y Actual Values Y hat (Predicted Values) Lets Say we are predicting the presence of disease which means yes – they have disease and no means – they don’t. 1. The classifier made total 165 predictions. 2. Out of those 165 cases, the classifier predicted "yes" 110 times, and "no" 55 times. 3. In reality, 105 patients in the sample have the disease, and 60 patients do not. Lets understand the basic terms • True positives • True Negatives • False positives • False Negatives Confusion Matrix
  • 62. Performance of Logistic Regression Model 62 1. AIC (Akaike Information Criteria) – The analogous metric of adjusted R² in logistic regression is AIC. AIC is the measure of fit which penalizes model for the number of model coefficients. Therefore, we always prefer model with minimum AIC value. 2. Confusion Matrix : It is the tabular representation of Actual vs Predicted Values. It helps us find the accuracy of the model. The accuracy is calculated using the following equation 3. ROC Curve: • The ROC Curve is known as Receiver Operating Characteristic Curve. It evaluates the trade off between true positive rate and false positive rate. • It is advisable to assume p>0.5 (threshold value) since, we are more concerned about the success of the model. • Higher the area under the curve, better the prediction power of the model would be.
  • 63. Logistics Regression Assumptions 63 • Logistics Regression does not need any linear relationship between the dependent and independent variables. • The error (residuals) need not be normally distributed. • There should be little to no multicollinearity amongst the independent variables. • The outcome is binary variable like yes or no, 1 or 0, positive or negative etc. • For a Binary Regression, the factor level 1 of the dependent variable should represent the desired outcome. • There is a linear relationship between the logit of the outcome and each predictor variables. Recall that the logit function is logit(p) = log(p/(1-p)), where p is the probabilities of the outcome. • Logistics Regression requires quite large sample sizes.
  • 65. Forward Selection 65 1. Its a process which begins with empty model and keeps adding variables one by one. 2. Begins with Intercept. Tests are performed to find the relevant variables as ‘best’ variables. 3. The best variable shall return the highest coefficient of Determination or R-Squared Value. 4. This process keeps going and once the model no longer improves the accuracy by adding more variables, the process stops. 5. Several Criterions are used to determine which variable goes in – lowest RMSE on cross validation, F Test Score or lowest P Value. Tip: While using Forward Selection, in order to test the accuracy of the model, its better to use the trained Classifier against test data to make predictions.
  • 66. Backward Selection 66 1. Its a process which begins with all variables and keeps removing predictors one by one. 2. Removing the variable with the largest p-value | meaning the variable that is least significant. 3. The new (p-1) variable model is a better model where the largest p value is removed. 4. This process keeps going and once the model has significant p value defined, we may stop the process. 5. Several Criterions are used to determine which variable goes out – lowest RMSE on cross validation, F Test Score or lowest P Value
  • 67. Let’s review some concepts Linear Regression Assumptions of Linear Regression Difference Between Linear and Logistics Regression Logistics Regression Diagnostic Plots Forward and Backward Elimination 67
  • 68. Thanks! Any questions? You can find me at Linkedin @mkschauhan mukul.mschauhan@gmail.com 68 https://www.linkedin.com/pulse/data-visualisation-using-seaborn-mukul-kr-singh-chauhan https://www.linkedin.com/pulse/introduction-ggplot-series-mukul-kr-singh-chauhan/ https://www.linkedin.com/pulse/what-data-science-mukul-chauhan/