Regression Modelling Overview

262
-1

Published on

This is a set of slides intended to provide some motivating examples for studying regression at University level.

Published in: Education, Business, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
262
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
14
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Regression Modelling Overview

  1. 1. Why regressionToday’s Lecture: Overview of Regression (1: Motivating Examples)
  2. 2. Why regressionWine Quality In “Super Crunchers” Ian Ayres gives a formula for wine quality as: , Wine quality = 12.145 + 0.00117 × winter rainfall + 0.0614 × average growing season temperature − 0.00386 × harvest rainfall What is this formula telling us?
  3. 3. Why regressionMotivating examples 1 The relationship between restaurant characteristics and location, and the prices charged 2 The relationship between wine price and critic ratings 3 The relationship between various “risk factors” and the occurence of heart disease 4 One more example I haven’t decided on yet (and isn’t therefore in the notes yet - suggestions welcome!) 5 Later we will examine the relationship between various characteristics of golfers and the money they earn Along the way, we will consider “smaller” datasets to illustrate specific points.
  4. 4. Why regressionMotivating examples 1New York Restaurants 1 This is intended to give an example of where we want to be by the end of Term 1 - so you have an idea of what we are learning and why. 2 In other words, relax, sit back and get a feel for what you will be able to do after we’ve spent a whole term studying it.
  5. 5. Why regressionMotivating examples 1Zagat Price Guide Example (Manhattan Restaurant Pricing) 1 Sheather [2009] suggests you have been retained to advise a chef on Menu pricing for a new Italian restaurant in Manhattan 2 He provides data from “Zagat Survey 2001: New York City Restaurants, Zagat, New York” 3 Given a model, you can predict the effect of various restaurant characteristics on the kind of price you can charge 4 Specifically, we could try to decide whether you can charge more for a restaurant that is “East” of the river. This is denoted by an “indicator” variable which takes on the value 1 for a restaurant East of the river, and 0 otherwise
  6. 6. Why regressionMotivating examples 1Loading the data > nyc.df <- read.csv("data/nyc.csv") > summary(nyc.df) Case Restaurant Price Min. : 1.00 Amarone : 1 Min. :19.0 1st Qu.: 42.75 Anche Vivolo: 1 1st Qu.:36.0 Median : 84.50 Andiamo : 1 Median :43.0 Mean : 84.50 Arno : 1 Mean :42.7 3rd Qu.:126.25 Artusi : 1 3rd Qu.:50.0 Max. :168.00 Baci : 1 Max. :65.0 (Other) :162 Food Decor Service Min. :16.0 Min. : 6.00 Min. :14.0 1st Qu.:19.0 1st Qu.:16.00 1st Qu.:18.0 Median :20.5 Median :18.00 Median :20.0 Mean :20.6 Mean :17.69 Mean :19.4 3rd Qu.:22.0 3rd Qu.:19.00 3rd Qu.:21.0 Max. :25.0 Max. :25.00 Max. :24.0 East Min. :0.000 1st Qu.:0.000 Median :1.000 Mean :0.631 3rd Qu.:1.000 Max. :1.000 > pairs(nyc.df[, c(3:6)], main = "Pairs plot for Zagat price data")
  7. 7. Why regressionMotivating examples 1The data Pairs plot for Zagat price data 16 18 20 22 24 14 18 22 q q q q qq q q q q q q q q q 60 q q q q qqqqqq q q q q q q q q q q q q q q q q q q q qqq q q q q q q q q q q q q q q q q q q q qqq q qqqqqqq q q q q q q q q q q q q q q q q q qqq q qqqq q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q qqqqqqqq qqqq q q q q q q q q q q q Price q q q q q q q q q q q q q q qqqqq qqqq qqq q q q q q q q q q q q 40 q q q q q q q q q q q q q qq qq qqqqq q qq q q q q q q q q q q q q q q q q q q q q q q q q q qqqqqq q q q q q q q q q q q q q q q q q q q q q q q qqqqqq qqq q q q q q q q q q q q q q q q q q q q qqqqqqq qq q q q q q q q q q q q q q q q q q q q q q q q q q qqq q q q q q 20 q q q q q q q q q 24 q qqq q q q q q qqqqq q q q q qq q qqqqq q q q q q qqqqqqqq q q q q q q qq qqqqqqq qq qq qq q q qqqqqqq qq q q q q q q q qqqq qq q q q qq qq q qq q q q qqqqqq q q q q q q Food 20 q q qqqq qqqq q q qq q qqqq qqqqqqq q q q q q q q qqqqqqqqqq qq qq q q qqqqqqqqq q q q q q q q q q qqqq q q qqqq q q q qqqq qq q q q q q q q q qq q q qqq q q q q 16 q q qq q 25 q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q qqqq qq q q q q q q q q q q q q 20 q qq q qqqq q q qq q q q q q q q q q q q q q q qqqqqqq qq q q qq qq q q q q q q q q q q q q q q q q qq qqqqq q q qqq q q q q q q q q q q q q q q q q qqqq q qq q q qqq q q q q q q q q q q q q qqqq qqq q q q qqq qq q q q q q q q q q Decor q q q q q q q q 15 q qqq q q q qq q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q 10 q q q q q q q q q q q q q q q q q q q qq q q q q q q qqqqq 22 q q q qq qq qq q qq q q q q q qqqq q q q qqqqqqq qq q qqqqqq q q q q q q qqqqqqqqq qq q qq qqq qq qq qq qq qq q q q q q q q q qqqqqqqqq qq qqqqq qqq q q qq q q q q q qqqqqq Service 18 q qqqqqqqq q q q qq q q q q q qqqqqqqq q qqqqqqqqq qq qq q qqq q q q q qqqqq qq q qq qq q q q q q q q qqq q q q q q q q q qqq q 14 q q q 20 30 40 50 60 10 15 20 25
  8. 8. Why regressionMotivating examples 1Loading the data One of our predictor variables is not continuous. It is a qualitative/categorical/nominal/factor with one level variable which we use as an indicator/dummy variable such that it is equal to 1 if the restaurant is East of the river, and 0 otherwise. We can best examine the relationship between this variable and the Price by means of a boxplot. > boxplot(Price ~ East, data = nyc.df, col = "orange", main = "Effect of East on price", ylab = "Price", xlab = "East")
  9. 9. Why regressionMotivating examples 1The boxplot Effect of East on price 60 50 Price 40 30 20 0 1 East
  10. 10. Why regressionMotivating examples 1Fitting a model > nyc.lm1 <- lm(Price ~ Food + Decor + Service + East, data = nyc.df) > summary(nyc.lm1) Call: lm(formula = Price ~ Food + Decor + Service + East, data = nyc.df) Residuals: Min 1Q Median 3Q Max -14.0465 -3.8837 0.0373 3.3942 17.7491 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -24.023800 4.708359 -5.102 9.24e-07 Food 1.538120 0.368951 4.169 4.96e-05 Decor 1.910087 0.217005 8.802 1.87e-15 Service -0.002727 0.396232 -0.007 0.9945 East 2.068050 0.946739 2.184 0.0304 Residual standard error: 5.738 on 163 degrees of freedom Multiple R-squared: 0.6279, Adjusted R-squared: 0.6187 F-statistic: 68.76 on 4 and 163 DF, p-value: < 2.2e-16
  11. 11. Why regressionMotivating examples 1The model Price = −24.02 + 1.54xFood + 1.91xDecor + 0xService + 2.07xEast + i We can see that the higher the values of xFood , the higher the price charged. We can also see that for every unit increase in xFood , the value of yPrice increases by 1.54. There doesn’t appear to be a relationship between Service and Price. This is rather interesting, as it implies you could employ Basil Fawlty to look after all the diners.
  12. 12. Why regressionMotivating examples 1What to do about Service The most important thing to note for now is that these values are only estimates! We will study inference more formally, but for now we shall use a Key Point: Rule of two If the absolute value of an estimate divided by its standard error is less than 2, we can’t even be sure what sign the estimate should have
  13. 13. Why regressionMotivating examples 1Modifying the model Some people might remove this variable from our model. > nyc.lm2 <- update(nyc.lm1, Price ~ . - Service) > summary(nyc.lm2) Call: lm(formula = Price ~ Food + Decor + East, data = nyc.df) Residuals: Min 1Q Median 3Q Max -14.0451 -3.8809 0.0389 3.3918 17.7557 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -24.0269 4.6727 -5.142 7.67e-07 Food 1.5363 0.2632 5.838 2.76e-08 Decor 1.9094 0.1900 10.049 < 2e-16 East 2.0670 0.9318 2.218 0.0279 Residual standard error: 5.72 on 164 degrees of freedom Multiple R-squared: 0.6279, Adjusted R-squared: 0.6211 F-statistic: 92.24 on 3 and 164 DF, p-value: < 2.2e-16
  14. 14. Why regressionMotivating examples 1Can you look at the adjusted R2 Key Point: Adjusted R2 The R2 is a diagnostic (values between 0 and 1) which tells us the“proportion of variation in Y explained by our model. The adjusted R2 incorporates a penalty to account for the number of variables we have used. What do you notice about the adjusted R-squared?
  15. 15. Why regressionMotivating examples 1We could also compare the two models: Model 1: Price = −24.02 + 1.54xFood + 1.91xDecor + 0xService + 2.07xEast + i Model 2: Price = −24.03 + 1.54xFood + 1.91xDecor + 2.07xEast + i So it looks as if we could indeed suggest prices to charge, based on ratings of the other variables.
  16. 16. Why regressionMotivating examples 1Residual checking As well as fitting models, we have to make sure they are sensible. This involves checking all the assumptions we made when fitting the model. Again, we haven’t said anything about the assumptions yet. But just to introduce the ideas let’s check two 1 Check that the residuals are Normally distributed 2 Check that the residuals have constant variance
  17. 17. Why regressionMotivating examples 1Checking the Normality of the residuals One method we could use it to examine a histogram of the residuals: > hist(resid(nyc.lm2), freq = FALSE) > curve(dnorm(x, 0, summary(nyc.lm2)$sigma), add = TRUE, col = "red")
  18. 18. Why regressionMotivating examples 1 Histogram of resid(nyc.lm2) 0.06 0.05 0.04 Density 0.03 0.02 0.01 0.00 −15 −10 −5 0 5 10 15 20 resid(nyc.lm2) Figure: Histogram of residuals from model fit, normal model superimposed
  19. 19. Why regressionMotivating examples 1Checking the constant variance assumption We will use some kind of plot of residuals against the fitted values (the points on the regression line corresponding to individuals in the dataset). We start with a simple plot of residuals against fitted values. > plot(fitted(nyc.lm2) ~ resid(nyc.lm2))
  20. 20. Why regressionMotivating examples 1 q 60 q q q q qq q q q qq q q q q q qq q q qqq q 50 q q q q q q qq q q q q qq q q q q q q q q q q q q fitted(nyc.lm2) q q q q q q q q q q q q q q qq qq q q q q q q q q q q q qq q qq q q q q qq q q q 40 q q q qq q q qqqqq qq q qq qq q q q q qq q q q q q q q q q q q q qqqq q q q q q q q q qqqq q q q q q q q q q q q 30 q q q q 20 q −15 −10 −5 0 5 10 15 resid(nyc.lm2) Figure: Plot of fitted values versus residuals
  21. 21. Why regressionMotivating examples 1Assumption checking We could conclude that (a) the residuals appear Normal and (b) the variance appears constant. As we are happy with our model, we can answer the most subtantive question. Having a restaurant East of the river seems to add $2.07 to the price you can charge for a meal.
  22. 22. Why regressionMotivating examples 1Summary of what we have done 1 We have specified a problem and collected some data 2 We have carried out an exploratory data analysis 3 We have fitted an appropriate model to the data 4 We have checked the assumptions made when fitting that model 5 We have made some adjustments to the model 6 We have attempted to draw some conclusions
  23. 23. Why regressionMotivating examples 1What we need to do 1 Think about the kinds of problems we can examine by regression modelling 2 Learn (revise and extend what you did in STAT1401) how to carry out an exploratory data analysis 3 Learn about the types of models we can build, and the assumptions we make when building them 4 Learn more about how to check the model assumptions, and understand some of the problems when they not met 5 Learn more about how to alter the structure of a model, in particular how to decide in observational studies which variables to include and exclude 6 How to interpret the results of model fitting and, when appropriate, how to carry out statistical inference on the results
  24. 24. Why regressionMotivating examples 1How we can assess this 1 Ask you to discuss the reasons for a particular study, how we deal with the different variables (exam) 2 Ask you to carry out eda (coursework), or comment on an eda that has been carried out (exam) 3 Fit (coursework) an appropriate model for a particular dataset. Discuss and explain the principles behind various models (exam) 4 Carry out residual checks and make adjustments to a model (coursework), comment on residual checks / explain why adjustments have been made (exam) 5 Carry out and report a model building exercise (coursework), explain someone else’s model building (exam) 6 Interpret the results of your own (coursework) or someone else’s model fitting (exam)
  25. 25. Why regressionMotivating examples 1 R.D. Cook and S. Weisberg. Applied Regression Including Computing and Graphics. John Wiley, Hoboken NJ, 1999. Simon Sheather. A Modern Approach to Regression with R. Springer Texts in Statistics. Sheather. Springer Verlag, New York, 2009.
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×