Regression analysis is a statistical method used to model the relationship between a dependent variable and one or more independent variables. It is commonly used to predict or estimate the value of a dependent variable based on the values of independent variables.
Here are the key components and concepts of regression analysis:
Dependent Variable: The dependent variable, also known as the response variable or outcome variable, is the variable that you want to predict or explain. It is typically denoted as Y.
Independent Variables: Independent variables, also known as predictor variables or explanatory variables, are the variables that are used to predict or explain the dependent variable. They are denoted as X1, X2, X3, and so on. The relationship between the dependent variable and independent variables is expressed through a mathematical equation.
Regression Equation: The regression equation is a mathematical representation of the relationship between the dependent variable and independent variables. It takes the form of Y = β0 + β1X1 + β2X2 + ... + ε, where β0 is the intercept (the value of the dependent variable when all independent variables are zero), β1, β2, etc. are the coefficients representing the impact of each independent variable, and ε is the error term accounting for unexplained variation in the dependent variable.
Regression Coefficients: The regression coefficients (β1, β2, etc.) represent the change in the dependent variable for a one-unit change in the corresponding independent variable while holding other variables constant. They indicate the direction and strength of the relationship between the variables.
Ordinary Least Squares (OLS): OLS is a commonly used method to estimate the regression coefficients. It minimizes the sum of the squared differences between the observed values of the dependent variable and the predicted values based on the regression equation.
Residuals: Residuals are the differences between the observed values of the dependent variable and the predicted values from the regression equation. Residual analysis helps assess the goodness of fit of the regression model and identifies any systematic patterns or outliers.
Regression analysis can be applied in various fields, including economics, finance, social sciences, and marketing, among others. It allows for understanding the relationships between variables, making predictions, and testing hypotheses.
There are different types of regression analysis, including simple linear regression (with one independent variable), multiple linear regression (with multiple independent variables), logistic regression (for binary outcomes), and nonlinear regression (when the relationship between variables is not linear).
Overall, regression analysis provides a powerful tool for modeling and analyzing the relationships between variables and making predictions based on data.
1. Chapter 08
Regression
Dr. Steffen Herbold
herbold@cs.uni-goettingen.de
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
2. Outline
• Overview
• Linear Regression Models
• Comparison of Regression Models
• Summary
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
3. Example of Regression
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Regression
Top Speed
Fuel
consumption
5. The Formal Problem
• Object space
• 𝑂 = 𝑜𝑏𝑗𝑒𝑐𝑡1, 𝑜𝑏𝑗𝑒𝑐𝑡2, …
• Often infinite
• Representations of the objects in a (real valued) feature space
• ℱ = 𝜙 𝑜 , 𝑜 ∈ 𝑂 = 𝑥1, … , 𝑥𝑚 ∈ ℝ𝑚
= 𝑋
• „Independent“ variables
• Dependent variable
• 𝑓∗ 𝑜 = 𝑦 ∈ ℝ
• A regression function
• 𝑓: ℝ𝑚
→ ℝ
• Regression
• Finding an approximation for 𝑓
• Relationship between dependent and independent variable
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Independent
variable 𝑥
Dependent
variable y
Function 𝑓
6. Quality of Regressions
• Goal: Approximation of the dependent variable
• 𝑓∗ 𝑜 ≈ ℎ(𝜙 𝑜 )
Use Test Data
• Structure is the same as training data
• Apply approximated regression function
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
𝝓(𝒐) 𝒇∗
(𝒐) 𝒇(𝝓 𝒐 )
Top Speed Engine Size Horse Power Weight Year value prediction
250 1.4 130 1254 2003 7.8 7.5
280 1.8 185 1430 2010 6.3 6.9
… … … … … …
How do you evaluate
𝑓∗
𝑜 ≈ 𝑓(𝜙 𝑜 )
7. Visual Comparison
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Perfect Prediction
Deviation from blue line as visual
indicator for prediction quality
Good prediction. Data close
to blue line, regular pattern of
deviations
Allows insights into where
predictions are good/bad
8. Residuals
• Differences between predictions and actual values
• 𝑒𝑥 = 𝑦 − 𝑓(𝑥)
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
9. Visual Comparison of a Bad Fit
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
10. Measures for Regression Quality
• Mean Absolute Error (MAE)
• 𝑀𝐴𝐸 =
1
|𝑋| 𝑥∈𝑋 |𝑒𝑥|
• Mean Squared Error (MSE)
• 𝑀𝑆𝐸 =
1
|𝑋| 𝑥∈𝑋 𝑒𝑥
2
• R squared (𝑅2)
• Fraction of the variance that is explained by the regression
• 𝑅2 = 1 − 𝑥∈𝑋 𝑦−𝑓(𝑥) 2
𝑥∈𝑋 𝑦−𝑚𝑒𝑎𝑛 𝑦
2
• Adjusted R squared (𝑅2
)
• Takes number of features into account
• 𝑅2
= 1 − (1 − 𝑅2
)
|𝑋|−1
𝑋 −𝑚−1
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
11. Outline
• Overview
• Linear Regression Models
• Comparison of Regression Models
• Summary
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
12. Linear Regression
• Regression as a linear function
• 𝑦 = 𝑏0 + 𝑏1𝑥1 + ⋯ 𝑏𝑚𝑥𝑚
• 𝑏0 is the interception with the axis
• 𝑏1, … , 𝑏𝑚 are the linear coefficients
• Calculated with Ordinary Least Squares
• Optimizes MSE!
• min 𝑏0 + 𝑋𝑏 − 𝑦 2
2
• 𝑋 =
𝑥1,1 ⋯ 𝑥1,𝑚
⋮ ⋱ ⋮
𝑥𝑛,1 ⋯ 𝑥𝑛,𝑚
• 𝑏 = (𝑏1, … , 𝑏𝑚)
• 𝑦 = (𝑦1, … , 𝑦𝑛)
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Square of Euclidean distance
𝑛 is the number of instances in the training data
13. Ridge Regression
• Still a linear function
• OLS allows multiple solutions for 𝑛 > 𝑚
• Ridge regression penalizes solutions with large coefficients
• Calculated with Tikhonov regularization
• min 𝑏0 + 𝑋𝑏 − 𝑦 2
2
+ Γ𝑏 2
2
• We use Γ = 𝛼𝐼
• Use 𝛼 to regulate regularization strength
• min 𝑏0 + 𝑋𝑏 − 𝑦 2
2
+ 𝛼 𝑏 2
2
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Regularization Term
Identity matrix
14. Lasso Regression
• Still a linear function
• Penalizing large coefficient does not remove redundencies
• Extreme example: identical features that predict perfectly
• y = 𝑥1 = 𝑥2,
• Ridge
• 𝑏1 = 𝑏2 = 0.5
• One coefficient zero would be better
• 𝑏1 = 1, 𝑏2 = 0
• Lasso: Ridge with Manhattan norm
• min 𝑏0 + 𝑋𝑏 − 𝑦 2
2
+ 𝛼 𝑏 1
• Increases the likelihood of coefficients being exactly zero
• Selects relevant features
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
15. Elastic Net Regression
• Still a linear function
• Lasso tends to select one of multiple correlated features at
random
• Potential loss of information
• Elastic Net combines Ridge and Lasso
• Keeps only relevant correlated features and minimizes coefficients
• Use ratio 𝜌 between alphas for assigning more weight to
Ridge/Lasso
• min 𝑏0 + 𝑋𝑏 − 𝑦 2
2
+ 𝜌 ⋅ 𝛼 𝑏 1
+
(1−𝜌)
2
𝛼 𝑏 2
2
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
16. Outline
• Overview
• Linear Regression Models
• Comparison of Regression Models
• Summary
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
17. Comparison of Regression Models
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
All models trained with the same data and almost same performance
Coefficient with
previously highest value
actually eliminated!
Smaller coefficient values
than OLS
Smaller coefficient values
than Lasso
18. Non-linear Regression
• Many relationships are not linear
• Polynomial Regression
• Support Vector Regression
• Neural Networks
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
19. Outline
• Overview
• Linear Regression Models
• Comparison of Regression Models
• Summary
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
20. Summary
• Regression finds relationships between independent and
dependent variables
• Linear regression as simple model often effective
• Regularization can improve solutions
• Lasso, Ridge, Elastic, …
• Many non-linear approaches
• Require care with the application
• Overfitting can be very easy
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science