Inference for simple linear regression (Ch. 7.3-7.4)
R 2 statistic (Ch. 8.6.2)
Association is not causation (Ch. 7.5.3)
Next class: Diagnostics for asssumptions of simple linear regression model (Ch. 8.2-8.3)
Goal of regression: Estimate the mean response Y for subpopulations X=x,
Example: Y= catheter length required, X=height
Simple linear regression model:
Estimate and by least squares – choose to minimize the sum of squared residuals (prediction errors)
Car Price Example
A used-car dealer wants to understand how odometer reading affects the selling price of used cars.
The dealer randomly selects 100 three-year old Ford Tauruses that were sold at auction during the past month. Each car was in top condition and equipped with automatic transmission, AM/FM cassette tape player and air conditioning.
carprices.JMP contains the price and number of miles on the odometer of each car.
Inference for Simple Linear Regression
Inference based on the ideal simple linear regression model holding.
Inference based on taking repeated random samples ( ) from the same subpopulations
( ) as in the observed data.
Types of inference:
Hypothesis tests for intercept and slope
Confidence intervals for intercept and slope
Confidence interval for mean of Y at X=X 0
Prediction interval for future Y for which X=X 0
Ideal Simple Linear Regression Model
Assumptions of ideal simple linear regression model
There is a normally distributed subpopulation of responses for each value of the explanatory variable
The means of the subpopulations fall on a straight-line function of the explanatory variable.
The subpopulation standard deviations are all equal (to
The selection of an observation from any of the subpopulations is independent of the selection of any other observation.
Sampling Distributions of and
See Display 7.7
Standard deviation is smaller for (i) larger n, (ii) smaller , (iii) larger spread in x (higher )
Hypothesis tests for and
Hypothesis test of vs.
Based on t-test statistic,
p-value has usual interpretation, probability under the null hypothesis that |t| would be at least as large as its observed value, small p-value is evidence against null hypothesis
Interpretation of null hypothesis: X is not a useful predictor of Y, mean of Y is not associated with X.
Test for vs. is based on an analogous test statistic.
Test statistics and p-values can be found on JMP output under parameter estimates, obtained by using fit line after fit Y by X.
For car price data, convincing evidence that both intercept and slope are not zero (p-value <.0001 for both).
Confidence Intervals for and
Confidence intervals provide a range of plausible values for and
95% Confidence Intervals:
Table A.2 lists . It is approximately 2.
Finding CIs in JMP: Go to parameter estimates, right click, click Columns and then click Lower 95% and Upper 95%.
For car price data set, CIs:
Two prediction problems
The used-car dealer has an opportunity to bid on a lot of cars offered by a rental company. The rental company has 250 Ford Tauruses, all equipped with automatic transmission, air conditioning and AM/FM cassette tape players. All of the cars in this lot have about 40,000 miles on the odometer. The dealer would like an estimate of the average selling price of all cars in this lot (or, virtually equivalently, average selling price of population of Ford Tauruses with above equipment and 40,000 miles on the odometer).
The used-car dealer is about to bid on a 3-year old Ford Taurus equipped with automatic transmission, air conditioner and AM/FM cassette tape player and with 40,000 miles on the odometer. The dealer would like to predict the selling price of this particular car.
Prediction problem (a)
Goal is to estimate the conditional mean of selling price given odometer reading=40,000,
Point estimate is
What is a range of plausible values for
Confidence Intervals for Mean of Y at X=X 0
What is a plausible range of values for ?
95% CI for :
Note about formula
Precision in estimating is not constant for all values of X. Precision decreases as X 0 gets farther away from sample average of X’s
JMP implementation: Use Confid Curves fit command under red triangle next to Linear Fit after using Fit Y by X, fit line. Use the crosshair tool to find the exact values of the confidence interval endpoints for a given X 0 .
Prediction Problem (b)
Goal is to estimate the selling price of a given car with odometer reading=40,000.
What are likely values for a future value Y 0 at some specified value of X (=X 0 )?
Best prediction is the estimated mean response for X=X 0 :
A prediction interval is an interval of likely values along with a measure of the likelihood that interval will contain response.
95% prediction interval for X 0 : If repeated samples are obtained from the subpopulations and a prediction interval is formed, the prediction interval will contain the value of Y 0 for a future observation from the subpopulation X 0 95% of the time.
Prediction Intervals Cont.
Prediction interval must account for two sources of uncertainty:
Uncertainty about the location of the subpopulation mean
Uncertainty about where the future value will be in relation to its mean
Prediction Error = Random Sampling Error + Estimation Error
Prediction Interval Formula
95% prediction interval at X 0
Compare to 95% CI for mean at X 0 :
Prediction interval is wider due to random sampling error in future response
As sample size n becomes large, margin of error of CI for mean goes to zero but margin of error of PI doesn’t.
JMP implementation: Use Confid Curves Indiv command under red triangle next to Linear Fit after using Fit Y by X, fit line. Use the crosshair tool to find the exact values of the confidence interval endpoints for a given X 0 .
The R-squared statistic, also called the coefficient of determination, is the percentage of response variation explained by the explanatory variable.
Unitless measure of strength of relationship between x and y
Total sum of squares = . Best sum of squared prediction error without using x.
Residual sum of squares =
R 2 =.6501. Read as “65.01 percent of the variation in car prices was explained by the linear regression on odometer.”
Interpreting R 2
R 2 takes on values between 0 and 1, with higher R 2 indicating a stronger linear association.
If the residuals are all zero (a perfect fit), then R 2 is 1. If the least squares line has slope 0, R 2 will be 0.
R 2 is useful as a unitless summary of the strength of linear association.
Caveats about R 2
R 2 is not useful for assessing model adequacy, i.e., does simple linear regression model hold (use residual plots) or whether or not there is an association (use test of
A good R 2 depends on the context. In precise laboratory work, R 2 values under 90% might be too low, but in social science contexts, when a single variable rarely explains great deal of variation in response, R 2 values of 50% may be considered remarkably good.
Association is not causation
A high means that x has a strong linear relationship with y – there is a strong association between x and y. It does not imply that x causes y.
Alternative explanations for high :
Reverse is true. Y causes X.
There may be a lurking (confounding) variable related to both x and y which is the common cause of x and y
No cause and effect relationship can be inferred unless X is randomly assigned to units in a random experiment.
A researcher measures the number of television sets per person X and the average life expectancy Y for the world’s nations. The regression line has a positive slope – nations with many TV sets have higher life expectancies. Could we lengthen the lives of people in Rwanda by shipping them TV sets?
A community in the Philadelphia area is interested in how crime rates affect property values. If low crime rates increase property values, the community may be able to cover the costs of increased police protection by gains in tax revenues from higher property values. Data on the average housing price and crime rate (per 1000 population) for communities in Pennsylvania near Philadelphia for 1996 are shown in housecrime.JMP.
Can you deduce a cause-and-effect relationship from these data? What are other explanations for the association between housing prices and crime rate other than that high crime rates cause low housing prices?
Does the ideal simple linear regression model appear to hold?