zekeLabs
Linear Regression
“Goal - Become a Data Scientist”
“A Dream becomes a Goal when action is taken towards its achievement” - Bo Bennett
“The Plan”
“A Goal without a Plan is just a wish”
● Deterministic vs Statistical Relations
● Introduction to Linear Regression
● Simple Linear Regression
● Model Evaluation
● Gradient Descent
● Polynomial Regression
● Bias and Variance
● Regularization
● Lasso Regression
● Ridge Regression
● Stochastic Gradient Descent
● Robust Regressors for data with outliers
Agenda
Deterministic vs Statistical Relations
● Deterministic Relations
○ Data is aligned properly
○ Relations can be formulated
○ Example: Converting Celsius to Fahrenheit
● Statistical Relations
○ They exhibit trend but not perfect relation
○ Data also exhibits some scatter
○ Example: Height vs Weight
Introduction to Linear Regression
● Simplest and widely used
● Prediction by averaging the data
● Better prediction by additional information
● Betterness is measured by Residuals
● Finding line of Best-fit
An Example
Meal Tip Amount ($) Residual Residual Sq.
1 5 -5 25
2 17 7 49
3 11 1 1
4 8 -2 4
5 14 4 16
6 5 -5 25
SSE 120
Better Prediction
Bill ($)
Tip Amount
($) Residual Residual Sq.
34 5 0.8495 0.7217
108 17 2.0307 4.1237
64 11 2.4635 6.0688
88 8 -4.0453 16.3645
99 14 0.3465 0.1201
51 5 -1.6359 2.6762
SSE 30.075
Simple Linear Regression
● One target variable and only one feature
● Follows general form of linear equation
‘Θ0’ is the intercept
‘Θ1’ is the slope of the line
● This is the estimation of a population data
Assumptions of Linear Regression
● The population line: yi=β0+β1xi+ϵi; E(Yi)=β0+β1xi
● E(Yi), at each value of xi is a Linear function of the xi
● The errors are
○ Independent
○ Normally distributed
○ Equal variances (denoted σ^2)
Line of Best-Fit
● Best-Fit line has a less value of SSE
● Sum of square of residual Errors - SSE
h(X) is the predicted value
● Penalizes higher error more
Coefficient of Determination
SSR - "Regression sum of squares" = sum(Yh - Ymn)^2
SSE - "Error sum of squares" = sum(Y - Yh)^2
SSTO - "Total sum of squares" = SSR + SSE = sum(Y - Ymn)^2
R-squared = SSR/SSTO = 1 - (SSE/SSTO)
"R-squared×100 percent of the variation in y is 'explained by' the variation in
predictor x"
The Cost Function
● Cost function is to optimize the parameters
● Norm 2 is preferred as cost function
● We use MSE (Mean Squared Error) as cost function
● MSE is average of the SSE
● Min SSE is the Least Squares Criterion
Normal Equation
● Derived by directly equating gradient to zero
● Simple equation but..
○ Closed form solution
○ Performance better only when less no.of features
○ No. of data points should be always greater than the no.of variables
○ Availability of better technique while Regularizing the model
Gradient Descent Algorithm
● Optimization is a big part of machine learning
● It is a simple optimization procedure
● Finds the values of parameters at global minima
● “Alpha” is learning rate
Math behind GD
GDs Calculated
Housing Data Min-Max Std. -(Y-Yh) -(Y-Yh)*X
House size (X) House price (Y) X Y Yh SSE dMSE/da dMSE/db
1,100 199,000 0 0 0.45 0.2025 0.45 0
1,400 245,000 0.22 0.22 0.62 0.16 0.4 0.088
1,425 319,000 0.24 0.58 0.63 0.0025 0.05 0.012
1,550 240,000 0.33 0.2 0.7 0.25 0.5 0.165
1,600 312,000 0.37 0.55 0.73 0.0324 0.18 0.0666
1,700 279,000 0.44 0.39 0.78 0.1521 0.39 0.1716
1,700 310,000 0.44 0.54 0.78 0.0576 0.24 0.1056
1,875 308,000 0.57 0.53 0.88 0.1225 0.35 0.1995
2,350 405,000 0.93 1 1.14 0.0196 0.14 0.1302
2,450 324,000 1 0.61 1.2 0.3481 0.59 0.59
a b sum 1.3473 3.300 1.545
0.45 0.75 MSE 0.0673 0.330 0.154
Deep Dive
X = 1,400; Y= 2,45,000; a = 0.45; b = 0.75; m = total no.of data = 10
Xs = (X - Xmin)/(Xmax - Xmin) = (1,400 - 1,100)/(2,450 - 1,100) = 0.22
Ys = (Y - Xmin)/(Ymax - Ymin) = (245 - 199)/(405 - 199) = 0.22
Yh = a + bXs = 0.45 + 0.75*(0.22) = 0.62
SSEi = (Ys - Yh)^2 = (0.22 - 0.62)^2 = 0.16
Gradients: dMSE/da = -(Ys-Yh) = 0.4
dMSE/db = -(Ys-Yh)*Xs = 0.088
MSE = (1/2m)*sum(SSEi) = 0.0673
Polynomial Regression
● Derives features
● Better in estimating values if the
trend is nonlinear
● Predicts a curve rather than
a simple line
● This plot is linear in
2-D space - Multiple regression
Bias-Variance Tradeoff
The Bulls-Eye Diagram
Regularization
● To overcome overfitting problem
● Overfitted model has high variant estimates
● High variant estimates, not good estimates
● Trading between bias and variance is achieved
● Limiting the parameters
● Different techniques to limit the paramates
L2 - Regularization
● Objective = RSS + α * (sum of square of coefficients)
○ α = 0: The objective becomes same as simple linear regression
○ α = ∞: The coefficients will be zero
○ 0 < α < ∞: The coefficients will be somewhere between 0 and ones for simple linear
regression
● As the value of alpha increases, the model complexity reduces
● Though the coefficients are very very small, they are NOT zero
L1 - Regularization
● Objective = RSS + α * (sum of absolute value of coefficients)
● For the same values of alpha, the coefficients of lasso regression are
much smaller as compared to that of ridge regression
● For the same alpha, lasso has higher RSS (poorer fit) as compared to ridge
regression
● Many of the coefficients are zero even for very small values of alpha
L2 vs L1
L2 Reg. L1 Reg.
Key Differences
Includes all (or none) of the
features in the model
Performs feature selection
Typical Use Cases
Majorly used to prevent
overfitting
Sparse solutions - modelling cases
where the features are in millions or
more
Presence of Highly
Correlated Features
Works well even in presence of
highly correlated features
Arbitrarily selects any one feature
among the highly correlated ones
Stochastic Gradient Descent
● Simple & yet efficient approach for linear models
● Supports out-of-core training
● Randomly select data & train model.
● Repeat the above step & model keeps tuning
Robust Regression
● Outliers have some serious
impact on estimation of
predictor
● Huber Regression vs Ridge
Regression

Linear regression

  • 1.
  • 2.
    “Goal - Becomea Data Scientist” “A Dream becomes a Goal when action is taken towards its achievement” - Bo Bennett “The Plan” “A Goal without a Plan is just a wish”
  • 3.
    ● Deterministic vsStatistical Relations ● Introduction to Linear Regression ● Simple Linear Regression ● Model Evaluation ● Gradient Descent ● Polynomial Regression ● Bias and Variance ● Regularization ● Lasso Regression ● Ridge Regression ● Stochastic Gradient Descent ● Robust Regressors for data with outliers Agenda
  • 4.
    Deterministic vs StatisticalRelations ● Deterministic Relations ○ Data is aligned properly ○ Relations can be formulated ○ Example: Converting Celsius to Fahrenheit ● Statistical Relations ○ They exhibit trend but not perfect relation ○ Data also exhibits some scatter ○ Example: Height vs Weight
  • 5.
    Introduction to LinearRegression ● Simplest and widely used ● Prediction by averaging the data ● Better prediction by additional information ● Betterness is measured by Residuals ● Finding line of Best-fit
  • 6.
    An Example Meal TipAmount ($) Residual Residual Sq. 1 5 -5 25 2 17 7 49 3 11 1 1 4 8 -2 4 5 14 4 16 6 5 -5 25 SSE 120
  • 7.
    Better Prediction Bill ($) TipAmount ($) Residual Residual Sq. 34 5 0.8495 0.7217 108 17 2.0307 4.1237 64 11 2.4635 6.0688 88 8 -4.0453 16.3645 99 14 0.3465 0.1201 51 5 -1.6359 2.6762 SSE 30.075
  • 8.
    Simple Linear Regression ●One target variable and only one feature ● Follows general form of linear equation ‘Θ0’ is the intercept ‘Θ1’ is the slope of the line ● This is the estimation of a population data
  • 9.
    Assumptions of LinearRegression ● The population line: yi=β0+β1xi+ϵi; E(Yi)=β0+β1xi ● E(Yi), at each value of xi is a Linear function of the xi ● The errors are ○ Independent ○ Normally distributed ○ Equal variances (denoted σ^2)
  • 10.
    Line of Best-Fit ●Best-Fit line has a less value of SSE ● Sum of square of residual Errors - SSE h(X) is the predicted value ● Penalizes higher error more
  • 11.
    Coefficient of Determination SSR- "Regression sum of squares" = sum(Yh - Ymn)^2 SSE - "Error sum of squares" = sum(Y - Yh)^2 SSTO - "Total sum of squares" = SSR + SSE = sum(Y - Ymn)^2 R-squared = SSR/SSTO = 1 - (SSE/SSTO) "R-squared×100 percent of the variation in y is 'explained by' the variation in predictor x"
  • 12.
    The Cost Function ●Cost function is to optimize the parameters ● Norm 2 is preferred as cost function ● We use MSE (Mean Squared Error) as cost function ● MSE is average of the SSE ● Min SSE is the Least Squares Criterion
  • 13.
    Normal Equation ● Derivedby directly equating gradient to zero ● Simple equation but.. ○ Closed form solution ○ Performance better only when less no.of features ○ No. of data points should be always greater than the no.of variables ○ Availability of better technique while Regularizing the model
  • 14.
    Gradient Descent Algorithm ●Optimization is a big part of machine learning ● It is a simple optimization procedure ● Finds the values of parameters at global minima ● “Alpha” is learning rate
  • 15.
  • 16.
    GDs Calculated Housing DataMin-Max Std. -(Y-Yh) -(Y-Yh)*X House size (X) House price (Y) X Y Yh SSE dMSE/da dMSE/db 1,100 199,000 0 0 0.45 0.2025 0.45 0 1,400 245,000 0.22 0.22 0.62 0.16 0.4 0.088 1,425 319,000 0.24 0.58 0.63 0.0025 0.05 0.012 1,550 240,000 0.33 0.2 0.7 0.25 0.5 0.165 1,600 312,000 0.37 0.55 0.73 0.0324 0.18 0.0666 1,700 279,000 0.44 0.39 0.78 0.1521 0.39 0.1716 1,700 310,000 0.44 0.54 0.78 0.0576 0.24 0.1056 1,875 308,000 0.57 0.53 0.88 0.1225 0.35 0.1995 2,350 405,000 0.93 1 1.14 0.0196 0.14 0.1302 2,450 324,000 1 0.61 1.2 0.3481 0.59 0.59 a b sum 1.3473 3.300 1.545 0.45 0.75 MSE 0.0673 0.330 0.154
  • 17.
    Deep Dive X =1,400; Y= 2,45,000; a = 0.45; b = 0.75; m = total no.of data = 10 Xs = (X - Xmin)/(Xmax - Xmin) = (1,400 - 1,100)/(2,450 - 1,100) = 0.22 Ys = (Y - Xmin)/(Ymax - Ymin) = (245 - 199)/(405 - 199) = 0.22 Yh = a + bXs = 0.45 + 0.75*(0.22) = 0.62 SSEi = (Ys - Yh)^2 = (0.22 - 0.62)^2 = 0.16 Gradients: dMSE/da = -(Ys-Yh) = 0.4 dMSE/db = -(Ys-Yh)*Xs = 0.088 MSE = (1/2m)*sum(SSEi) = 0.0673
  • 18.
    Polynomial Regression ● Derivesfeatures ● Better in estimating values if the trend is nonlinear ● Predicts a curve rather than a simple line ● This plot is linear in 2-D space - Multiple regression
  • 19.
  • 20.
  • 21.
    Regularization ● To overcomeoverfitting problem ● Overfitted model has high variant estimates ● High variant estimates, not good estimates ● Trading between bias and variance is achieved ● Limiting the parameters ● Different techniques to limit the paramates
  • 22.
    L2 - Regularization ●Objective = RSS + α * (sum of square of coefficients) ○ α = 0: The objective becomes same as simple linear regression ○ α = ∞: The coefficients will be zero ○ 0 < α < ∞: The coefficients will be somewhere between 0 and ones for simple linear regression ● As the value of alpha increases, the model complexity reduces ● Though the coefficients are very very small, they are NOT zero
  • 23.
    L1 - Regularization ●Objective = RSS + α * (sum of absolute value of coefficients) ● For the same values of alpha, the coefficients of lasso regression are much smaller as compared to that of ridge regression ● For the same alpha, lasso has higher RSS (poorer fit) as compared to ridge regression ● Many of the coefficients are zero even for very small values of alpha
  • 24.
    L2 vs L1 L2Reg. L1 Reg. Key Differences Includes all (or none) of the features in the model Performs feature selection Typical Use Cases Majorly used to prevent overfitting Sparse solutions - modelling cases where the features are in millions or more Presence of Highly Correlated Features Works well even in presence of highly correlated features Arbitrarily selects any one feature among the highly correlated ones
  • 25.
    Stochastic Gradient Descent ●Simple & yet efficient approach for linear models ● Supports out-of-core training ● Randomly select data & train model. ● Repeat the above step & model keeps tuning
  • 26.
    Robust Regression ● Outliershave some serious impact on estimation of predictor ● Huber Regression vs Ridge Regression