2. Regression Introduced Regression is about prediction Predicting an unknown point based on observations (or measurements) Widgets sold based on advertising We can explore known relationships We can explore unknown relationships
3. The Variables Outcome Variable The thing we are predicting Number of widgets sold Predictor Variable (simple regression) The variables that you know about Advertising dollars Predictor Variables (multiple regression) We predict values of a dependent variable (outcome) using one or more independent variables (predictors)
4. The model Any prediction follows the basic formula: Outcomei = (model) + errori In regression our model contains several things: Slope of the line (that best fits the data measured) = b1 Intercept of the line (at the Y axis) b0 So our model = Yi = (b0 + b1Xi) + Errori Do you recognize this equation? The model is simply a line
5. So how do we calculate this line? The Method of Least Squares The line that is the closest to all the data points Residuals = Deviations (distance of actual data points to the line) Square these residuals to get rid of negatives Then sum them.
6. How well does this line fit? No line is perfect (there are always residuals) If our line is a good one it should be better than a basic line (significantly so) We compare our line to a basic line: Deviation = SUM (observed – model)2 This is basically a ‘mean’ (model) The mean is an awful predictor No matter how much you spend on adverts – the sales of your widgets are the same
7. Fitness continued SSt = total sum of squared differences (using the mean) SSr= total residual sum of squares (using our best fit model) Represents a degree of inaccuracy SSm (model sum of squares) = SSt – SSr Large = our model is different than the simple model Proportion of improvement: R2 = SSm/ SSt Percentage of variation in the outcome that can be explained by our model
8. More fitness You can assess this using an F test as well F is simply systematic variance/unsystematic variance In regression that means: Improvement of the model (SSm - systematic) and the difference between the model and the observed data (SSr – unsystematic) But we need to look at mean squares Because we need to use the average sums of squares in an F test. So we divide by degrees of freedom For SSm = the number of variables in the model For SSr = number of observations – the number of parameters being estimated (number of beta coefficients or predictors) F = MSM / MSR
9. Individual Predictors The coefficient bis essentially the gradient of the line If the predictor is not valuable then it will predict no change in the outcome as it changes. This would be b= 0 This is what the mean does If the predictor is valuable – then it will be significantly different than 0.
10. Individual Predictors cont. To test if b is different from 0 we will use a t-test. We are comparing how big the b value is in comparison to the amount of error in that estimate. We will then use the standard error of the bvalue. t = bobserved – bexpected / SEb Since the expected value is 0 (no change) then we have to simply divide the observed b value by the standard error of b to get the t score. Degress of freedom is calculated using the following: N – p – 1 (p = number of predictors)