Successfully reported this slideshow.
Upcoming SlideShare
×

# HRUG - Linear regression with R

335 views

Published on

a talk presented to the Houston R User Group on the basics of linear regression in R

Published in: Data & Analytics
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

### HRUG - Linear regression with R

1. 1. Linear Regression with R Ed Goodwin Houston R Users Group
2. 2. Recap from the last meetup • statistical learning vs. machine learning • supervised vs. unsupervised learning • categorical models vs. quantitative models
3. 3. Linear Regression is… • statistical learning • supervised learning • quantitative model
4. 4. A Simple Dataset
5. 5. What’s the best model for this data? …a straight line, aka a linear model…
6. 6. What’s the best ﬁt for the line?
7. 7. The line that minimizes the residual error, or point distance from the line which is why we refer to the regression line as the least squares error regression line
8. 8. We determine this with a linear regression to determine the y-intercept and the slope of the line that minimizes the error residuals Use the lm function in R to create linear models A regression on one variable is known as a simple linear regression
9. 9. Assumptions of Linear Models
10. 10. • relationship of predictors to predicted variables is linear • the variance of error terms is constant (homoskedastic) • minimal to no outliers in the data (high or low y in response to x) • minimal to no leverage points in the data (high or low x relative to the data) • no collinearity among predictor variables • predictors are additive to reliability of model (no interaction effects)
11. 11. Linear Model Fitting
12. 12. Data Analysis Bonds dataset from “A Modern Approach to Regression with R” Sheather, Simon. 2009. https://link.springer.com/book/10.1007%2F978-0-387-09608-7 Consider the following dataset of bond prices bonds.dat = read.csv("http://www.stat.tamu.edu/~sheather/book/docs/datasets/bonds.txt", sep='t')
13. 13. Data Analysis Data analysis is the art of asking questions of the data and searching for answers. What questions should we ask? • Is Bid Price a function of Coupon Rate or vice versa? • What type of relationship does Bid Price appear to have with Coupon Rate? • Is there a formula we could use to predict Bid Price based on Coupon Rate?
14. 14. Linear Model of Bond Prices as a function of Coupon Rates • Is this line a good or bad ﬁt with the data? • Why or why not? • Can we improve the model? • How?
15. 15. Know your data and know how your models work! • Why would outliers skew the linear regression model? • What should we do about it? • Why do these outliers exist? Outliers
16. 16. What are the outliers? A Flower bond is a U.S treasury bond recoverable before maturity upon payment or fulfilling a condition, if used to settle federal estate taxes. When flower bonds are surrendered in payment of taxes, and accepted as such, that constitutes payment of those taxes for statute of limitations and statutory interest purposes.
17. 17. Adjusted Bond Model • After removing the outliers the model looks much better • But how do you know that it’s a better model?
18. 18. Evaluating Models
19. 19. Evaluating the two bond models Key measures are p-value, Residual Std Error (RSE), and R2
20. 20. Residual Sum of Squares (RSS) “In statistics, the residual sum of squares (RSS), also known as the sum of squared residuals (SSR) or the sum of squared errors of prediction (SSE), is the sum of the squares of residuals (deviations predicted from actual empirical values of data).” Source: https://en.wikipedia.org/wiki/Residual_sum_of_squares
21. 21. Multiple Regression Models
22. 22. If Simple Regression Models depict a linear relationship between two variables, what do you think Multiple Regression Models do?
23. 23. Multiple Regression Models describe the relationship between a scalar variable and two or more predictor variables
24. 24. Why not just run several simple linear regressions?
25. 25. Advertising Data Set Sales based on multiple types and levels of advertising spend (TV, Radio, Newspaper)
26. 26. First, let’s look at the data
27. 27. Are the model assumptions intact?
28. 28. Multiple Regression uses the lm function as well. Simply modify the formula by adding more variables
29. 29. Analyze the Model Why is newspaper coefﬁcient so low?
30. 30. What if we removed Newspaper from the model?
31. 31. How do we model Interaction Effects? Modify the regression formula to include interactions between the predictor variables
32. 32. Does this interaction improve the model?
33. 33. What about reintroducing Newspaper with interaction?
34. 34. Model with Newspaper & TV interaction modeled
35. 35. What is the best model? The most accurate model on the training data is always the model with the most predictor variables (p) and the lowest residual sum of squares (RSS)…but what is the best model?
36. 36. The best model is… The best model is the simplest model with the most predictive power on the entire data population while staying within your resource constraints.
37. 37. Bonus: Easy and tidy with the broom package From the broom vignette: https://cran.r-project.org/web/packages/broom/vignettes/broom.html The broom package takes the messy output of built-in functions in R, such as lm, nls, or t.test, and turns them into tidy data frames. This package provides three S3 methods that do three distinct kinds of tidying. • tidy: constructs a data frame that summarizes the model's statistical ﬁndings. This includes coefﬁcients and p-values for each term in a regression, per-cluster information in clustering applications, or per-test information for multtest functions. • augment: add columns to the original data that was modeled. This includes predictions, residuals, and cluster assignments. • glance: construct a concise one-row summary of the model. This typically contains values such as R 2, adjusted R 2, and residual standard error that are computed once for the entire model.
38. 38. broom package using our Advertising data