This document provides an overview of linear regression techniques. It begins with introducing deterministic vs statistical relationships and simple linear regression. It then covers model evaluation, gradient descent, and polynomial regression. The document discusses bias-variance tradeoff and various regularization techniques like lasso, ridge regression and stochastic gradient descent. It concludes with discussing robust regressors that are robust to outliers in the data.
2. “Goal - Become a Data Scientist”
“A Dream becomes a Goal when action is taken towards its achievement” - Bo Bennett
“The Plan”
“A Goal without a Plan is just a wish”
3. ● Deterministic vs Statistical Relations
● Introduction to Linear Regression
● Simple Linear Regression
● Model Evaluation
● Gradient Descent
● Polynomial Regression
● Bias and Variance
● Regularization
● Lasso Regression
● Ridge Regression
● Stochastic Gradient Descent
● Robust Regressors for data with outliers
Agenda
4. Deterministic vs Statistical Relations
● Deterministic Relations
○ Data is aligned properly
○ Relations can be formulated
○ Example: Converting Celsius to Fahrenheit
● Statistical Relations
○ They exhibit trend but not perfect relation
○ Data also exhibits some scatter
○ Example: Height vs Weight
5. Introduction to Linear Regression
● Simplest and widely used
● Prediction by averaging the data
● Better prediction by additional information
● Betterness is measured by Residuals
● Finding line of Best-fit
8. Simple Linear Regression
● One target variable and only one feature
● Follows general form of linear equation
‘Θ0’ is the intercept
‘Θ1’ is the slope of the line
● This is the estimation of a population data
9. Assumptions of Linear Regression
● The population line: yi=β0+β1xi+ϵi; E(Yi)=β0+β1xi
● E(Yi), at each value of xi is a Linear function of the xi
● The errors are
○ Independent
○ Normally distributed
○ Equal variances (denoted σ^2)
10. Line of Best-Fit
● Best-Fit line has a less value of SSE
● Sum of square of residual Errors - SSE
h(X) is the predicted value
● Penalizes higher error more
11. Coefficient of Determination
SSR - "Regression sum of squares" = sum(Yh - Ymn)^2
SSE - "Error sum of squares" = sum(Y - Yh)^2
SSTO - "Total sum of squares" = SSR + SSE = sum(Y - Ymn)^2
R-squared = SSR/SSTO = 1 - (SSE/SSTO)
"R-squared×100 percent of the variation in y is 'explained by' the variation in
predictor x"
12. The Cost Function
● Cost function is to optimize the parameters
● Norm 2 is preferred as cost function
● We use MSE (Mean Squared Error) as cost function
● MSE is average of the SSE
● Min SSE is the Least Squares Criterion
13. Normal Equation
● Derived by directly equating gradient to zero
● Simple equation but..
○ Closed form solution
○ Performance better only when less no.of features
○ No. of data points should be always greater than the no.of variables
○ Availability of better technique while Regularizing the model
14. Gradient Descent Algorithm
● Optimization is a big part of machine learning
● It is a simple optimization procedure
● Finds the values of parameters at global minima
● “Alpha” is learning rate
18. Polynomial Regression
● Derives features
● Better in estimating values if the
trend is nonlinear
● Predicts a curve rather than
a simple line
● This plot is linear in
2-D space - Multiple regression
21. Regularization
● To overcome overfitting problem
● Overfitted model has high variant estimates
● High variant estimates, not good estimates
● Trading between bias and variance is achieved
● Limiting the parameters
● Different techniques to limit the paramates
22. L2 - Regularization
● Objective = RSS + α * (sum of square of coefficients)
○ α = 0: The objective becomes same as simple linear regression
○ α = ∞: The coefficients will be zero
○ 0 < α < ∞: The coefficients will be somewhere between 0 and ones for simple linear
regression
● As the value of alpha increases, the model complexity reduces
● Though the coefficients are very very small, they are NOT zero
23. L1 - Regularization
● Objective = RSS + α * (sum of absolute value of coefficients)
● For the same values of alpha, the coefficients of lasso regression are
much smaller as compared to that of ridge regression
● For the same alpha, lasso has higher RSS (poorer fit) as compared to ridge
regression
● Many of the coefficients are zero even for very small values of alpha
24. L2 vs L1
L2 Reg. L1 Reg.
Key Differences
Includes all (or none) of the
features in the model
Performs feature selection
Typical Use Cases
Majorly used to prevent
overfitting
Sparse solutions - modelling cases
where the features are in millions or
more
Presence of Highly
Correlated Features
Works well even in presence of
highly correlated features
Arbitrarily selects any one feature
among the highly correlated ones
25. Stochastic Gradient Descent
● Simple & yet efficient approach for linear models
● Supports out-of-core training
● Randomly select data & train model.
● Repeat the above step & model keeps tuning
26. Robust Regression
● Outliers have some serious
impact on estimation of
predictor
● Huber Regression vs Ridge
Regression