Chapt 11 & 12 linear & multiple regression minitab


Published on

  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Chapt 11 & 12 linear & multiple regression minitab

  1. 1. 3/22/2010 IE 609 Chapter 11 The Relation between Simple Linear Regression and Two Sets of Measures Correlation 1 2 The Relation between The Relation between Two Sets of Measures Two Sets of Measures• Construct a scatter diagram for the following • Plot Results data: 3 4 The Relation between The Relation between Two Sets of Measures Two Sets of Measures• You might have reversed the axes so that the • Linear or Straight Line Relationship vertical dimension represented the midterm grade and the horizontal dimension, the final grade. d• When one measure may be used to predict another, it is customary to represent the predictor on the horizontal dimension (the x- axis). 5 6 1
  2. 2. 3/22/2010 The Relation between The Relation between Two Sets of Measures Two Sets of Measures• Other relationships • Which of the diagrams represents the stronger relationship? 7 8 The Relation between Two Sets of Measures• Which of the diagrams represents the stronger relationship? Simple Linear Regression y = α + βx y = a + bxi + εi 9 10 Simple Linear Regression Simple Linear Regression Minitab Data Entry Calc ‐> Column Statistics Table 11.1 Pg 393 11 12 2
  3. 3. 3/22/2010 Simple Linear Regression Simple Linear Regression Calc ‐> Calculator (Create Formula, Store Variable: Residual 13 14 Simple Linear Regression Simple Linear Regression Graph ‐> Probability Plot Residuals appear Normally Distributed 15 16 Linear Regression Linear Regression and Correlation Simple Structure Simple Structure Question…….. Is the sample mean of Demand the correct value to use for ŷ? yi = ŷ + εi y i = ŷ + εiŷ → Sample mean = 34.0606 (Minitab*) – Although it might seem to be a trivial question,εi = yi - ŷ (Minitab “Residual”) you might ask why the sample mean (y-bar) was i h k h h l ( b ) the correct value to use for ŷ ?Sample Variance of ŷ = (10.7)2 – Since the purpose of the is to accurately describe the yi then we would expect the model is to* Mean of Demand , y (%) = 34.0606 deliver small errors (that is, εi) but how should we go about making small errors? 17 18 3
  4. 4. 3/22/2010 Linear Regression Linear Regression Simple Structure Simple Structure Question…….. Is the sample mean of Demand the correct value to use for ŷ? Σεi2  = Σ(ŷ – y)2 y i = ŷ + εi The calculus operation that delivers this solution is – A logical choice is to pick ŷ, which might be different from the sample mean, so that the error variance s2 calculated with εi = yi - ŷ is minimized. n  i 2 This is called the method of least squares because the method  s2  i 1 minimizes the error sum of squares. n 1 19 20 Linear Regression Linear Regression Simple Structure Simple StructureNow consider the scatter diagram below.  y appears to increase linearly with respect to x y = α + βx • The parameters α and β are the y axis intercept and slope, respectively. • Since we typically have sample data and not the complete population of (x, y) observations, we cannot expect to determine α and β, exactly- they will have to be estimated from the sample data. Our model is of the form There might be an underlying causal relationship between x and y of the form: y = a + bxi + εi y = α + βx 21 22 Linear Regression Linear Regression Simple Structure Simple Structure εi = (yi – ŷ) = yi – (a + bxi) • Then for any choice of a and b the εi may be • Although this equation allows us to calculate the ε, for a determined from given (x, yi) data set once a and b, are specified, there are still an infinite number of a and b, values that could εi = (yi – ŷ) = yi – (a + bxi ) be used in the model. Clearly the choice of a and b, that provides the best fit to the data should make the εi or some function of them small. Although many conditions • These errors or discrepancies εi , are also can be stated to define best fit lines by minimizing the εi , called the model residuals. by far the most frequently used condition to define the best fit line is the one that minimizes Σεi2. 23 24 4
  5. 5. 3/22/2010 Linear Regression and Correlation Linear Regression Simple Structure Simple Structure• That is, the best fit line for the (x, y) data, is called the • The error variance for linear least squares regression is linear least squares regression line, which corresponds to given by the choice of a and b, that minimizes Σεi2.• The calculus sol tion to this problem is given by the calc l s solution gi en b simultaneous solution to the two equations: where n is the number of (xi, yi) observations and sε is  n 2  n 2 called the standard error of the model. i  0 a i 1 i  0 b i 1 • The Equation has n- 2 in the denominator because two degrees of freedom are consumed by the calculation of the• The method of fitting a line to (xi, yi) data using the regression coefficients a and b from the experimental data. solution is called linear regression. 25 26 Linear Regression REGRESSION COEFFICIENTS Simple Structure • Think of the error variance sε2 in the regression • With the condition to determine the a and b, problem in the same way as you think of the sample variance s2 used to quantify the amount of variation in values that provide the best fit line for the simple measurement data. (xi, yi) data, namely the minimization of Σεi2, • Whereas the sample variance characterizes the scatter  we proceed to determine a and b in a more of observations about a single value y  y the error variance in the regression problem characterizes the rigorous manner. distribution of values about the line ŷi = a + bxi • Sε2 and s2 are close cousins, they are both measures of the errors associated with different models for different kinds of data. 27 28 REGRESSION COEFFICIENTS • The calculus method that determines the REGRESSION unique values of a and b, that minimize Σεi2 COEFFICIENTS requires that we solve the simultaneous equations: i Determining the unique values of a and b  n 2  n 2 i  0 a i 1 i  0 b i 1 29 30 5
  6. 6. 3/22/2010 REGRESSION COEFFICIENTS REGRESSION COEFFICIENTS• From these equations the resulting values of a • SSX, and SSY are just the sums of squares required to determine the variances of the x and b, are best expressed in terms of sums of and y values q squares: SS xy b a  y  bx SS x n SS y   ( yi  y ) 2 n SS x   ( xi  x ) 2 n n SS xy   ( xi  x ) ( yi  y ) SS y   ( yi  y ) 2 SS xy i 1 i 1b  i 1 n i 1 SS x SS x   ( xi  x ) 2 i 1 31 32 REGRESSION COEFFICIENTS REGRESSION COEFFICIENTS• Similarly, using the sum of squares notation, • Another important implication of Equations we can write the error sum of squares for the regression as SS xy a  y  bx b SS x• and the standard error as: • that the point ( x, y ) fall on the best-fit line. This is just a consequence of the way the sums of squares are calculated 33 34 LINEAR REGRESSION REGRESSION COEFFICIENTS ASSUMPTIONS Stats > Regression > Fitted Line Plot s2=SSE/(n‐2) a  y  bx ŷ= 35 36 6
  7. 7. 3/22/2010 LINEAR REGRESSION LINEAR REGRESSION ASSUMPTIONS ASSUMPTIONS • A valid linear regression model requires that five conditions are satisfied: l. The values of x are determined without error. 2. The εi, are normally distributed with mean με= 0 for all values of x. 3 . The distribution of the εi, has constant variance σε2 for all values of x within the range of experimentation (that is, homoscedasticity) 4. The εi are independent of each other. 5. The linear model provides a good fit to the data 37 38 HYPOTHESIS TESTS FOR REGRESSION COEFFICIENTS HYPOTHESIS TESTS FOR • The values of the intercept and slope a and bREGRESSION COEFFICIENTS found with Equations SS xy a  y  bx b SS x are actually estimates for the true parameters β α and β α α β 0 0 Hypothetical distributions for α and β 39 40 HYPOTHESIS TESTS FOR HYPOTHESIS TESTS FORREGRESSION COEFFICIENTS REGRESSION COEFFICIENTS • Although linear regression analysis will always return a and b values. its possible that one or both of these values could be statistically insignificant. We require a formal method of testing α and β to see if they are different from zero. Hypotheses for these tests are: H th f th t t β β α α0 0 H0: α0 = 0 H1: α0 ≠ 0 Hypothetical distributions for α and β H0: β 0 = 0Both of these distributions follow Students t distribution with H1: β 0 ≠ 0degrees of freedom equal to the error degrees of freedom. To perform these tests we need some idea of the amount of variability present in the estimates of α and β 41 42 7
  8. 8. 3/22/2010 HYPOTHESIS TESTS FOR HYPOTHESIS TESTS FOR REGRESSION COEFFICIENTS REGRESSION COEFFICIENTS • The hypothesis tests can be performed using• Estimates of the variances σα0 and σβ0 are given one-sample t tests with dfε = n -2 degrees of by: freedom with the t statistics. sα2 =  t  s and sβ2 =  t  s Microsoft Equation 43 3.0 44 HYPOTHESIS TESTS FOR HYPOTHESIS TESTS FOR REGRESSION COEFFICIENTS REGRESSION COEFFICIENTS• The (1 -α) 100% confidence intervals for α and • It is very important to realize that the variances of a and b as given are proportional to the standard error of the fit Sε. This means that if β are determined from there are any uncontrolled variables in the experiment that cause the standard error to increase. there will be a corresponding increase in the standard deviations of the regression coefficients. This could P(a - tα/2sa < α < a + tα/2sa ) = 1- α /2 /2 make the regression coefficients disappear into the noise. k th i ffi i t di i t th i • Always keep in mind that the models ability to predict the regression coefficients is dependent on the size of the standard error. P(b - tα/2sb < β < b+ tα/2sb ) = 1- α Take care to remove or control or account for extraneous variation so that you get the best predictions from your models with the least effort. with n -2 degrees of freedom. Microsoft Equation 3.0 45 46 CONFIDENCE LIMITS FOR THE CONFIDENCE LIMITS FOR THE REGRESSION LINE REGRESSION LINE Stat > Regression > Fitted Line Plot menu. • The true slope and intercept of a regression line are You will have to select Display Confidence Bands in the Options menu to add the  not exactly known. confidence limits to the fitted line plot. • The (l – α) 100% confidence interval for the regression line is given by: i li i i b 47 48 8
  9. 9. 3/22/2010 PREDICTION LIMITS FOR THE PREDICTION LIMITS FOR THE OBSERVED VALUES OBSERVED VALUES Stat>Regression> Fitted Line Plot menu. • The prediction interval provides prediction bounds You will have to select Display Prediction Bands in the Options menu for individual observations. The width of the prediction interval combines the uncertainty of the position of the true line as described by the confidence interval with the scatter of points about the line as measured by the standard error. where tα/2 has dfε= n - 2 degrees of freedom 49 50 CORRELATION Coefficient of Determination r2 • A comprehensive statistic is required to measure the fraction of the total variation in the response y that is explained by the regression model. CORRELATION • The total variation in y taken relative to y-bar is given by SSy = Σ(yi – y ) but SSy is partitioned into two terms: one that COEFFICIENT OF DETERMINATION (r2) accounts for the amount of variation explained by the straight line model given by SSregression and another that accounts for CORRELATION COEFFICIENT (r) the unexplained error variation given by . . n SS   ( yi  y ) 2   i 2 i 1 51 52 CORRELATION Coefficient of Determination r2 CORRELATION COEFFICIENT (r)• The three quantities are related by: • The correlation coefficient r is given by the square root of the coefficient of determination r2 with an appropriate plus or minus sign.• Consequently the fraction of SSy explained by the Consequently, • If two measures have a linear relationship, it is possible model is: to describe how strong the relationship is by means of a statistic called a correlation coefficient r. • The symbol for the correlation coefficient is r. • The symbol for the corresponding population parameter is ρ (the Greek letter "rho").where r2 is called the coefficient of determination. 53 54 9
  10. 10. 3/22/2010CORRELATION COEFFICIENT (r) CORRELATION COEFFICIENT (r) • Pearson product-moment correlation • The basic formulas for the correlation coefficient are 55 56PEARSONS PRODUCT-MOMENT CORRELATIONCORRELATION COEFFICIENT (r) The Coefficient of Determination r2 • Given a set of data. (Example 11.10, pg 435 ) • The coefficient of determination finds numerous applications in regression and multiple regression problems. Find r x y x y 0.414 29186 0.548 67095 • Since SSregression is bounded by 0≤ SSregression ≤SSy there are 0.383 0.399 29266 26215 0.581 0.557 85156 69571 corresponding bounds on the coefficient of determination 0.402 30162 0.55 84160 given by 0 ≤ r2 ≤ 1.1 0.442 38867 0.531 73466 0.422 37831 0.55 78610 • When r2 = 0 the regression model has little value because 0.466 0.5 44576 46097 0.556 0.523 67657 74017 very little of the variation in y is attributable to its 0.514 59698 0.602 87291 dependence on r. When r2 = 1 the regression model almost 0.53 67705 0.569 86836 0.569 66088 0.544 82540 completely explains all of the variation in the response, that 0.558 0.577 78486 89869 0.557 0.53 81699 82096 is, r almost perfectly predicts y. 0.572 0.548 77369 67095 0.547 0.585 75657 80490 • Were usually hoping for r2 = l, but this rarely happens. 57 58 Confidence Interval for the Confidence Interval for the Coefficient of Determination r2 Correlation Coefficient (r) • When the distribution of the regression model residuals• The coefficient of determination r2 is a statistic that is normal with constant variance, the distribution of r is represents the proportion of the total variation in the complicated, but the distribution of: values of the variable Y that can be accounted for or explained by a linear relationship with the random l i d b li l i hi i h h d is appro imatel normal with mean: approximately ith variable X. and standard deviation:• A different data set of (x, y) values will give a different value of r2. The quantity that such r2 values • The transformation of r into Z is called Fishers Z estimate is the true population coefficient of transformation. determination p2, which is a parameter. 59 60 10
  11. 11. 3/22/2010 Confidence Interval for the LINEAR REGRESSION Correlation Coefficient (r) WITH MINITAB• This information can be used to construct a • MINITAB provides two basic functions for confidence interval for the unknown parameter performing linear regression µz from the statistic r and the sample size n. 1. Stat Regression> Fitted Line Plot menu is The confidence interval is: the best place to start to evaluate the q p quality y of the fitted function. Includes a scatter plot of the (x, y,) data with the superimposed fitted line, a full ANOVA table and an abbreviated table of regression coefficients. 61 62 LINEAR REGRESSION LINEAR REGRESSION WITH MINITAB WITH MINITAB Stat>Regression> Regression menu  2. Stat>Regression> Regression menu The first part is a table of the regression coefficients and the corresponding standard deviations, t values, and p values. The second part is the ANOVA table, which summarizes the statistics required to determine the regression coefficients and the summary statistics like r, r2, radj. and sε. t ti ti lik d There is a p-value reported for the slope of the regression line in the table of regression coefficients and another p value reported in the ANOVA table for the ANOVA F test. These two p values are numerically identical and not just by coincidence. There is a special relationship that exists between the t and F distributions when the F distribution has one numerator degree of freedom. 63 64 POLYNOMIAL MODELS • The general form of a polynomial model is: ŷ = a + b1 x + b2x2 + …+bpxp POLYNOMIAL MODELS where the polynomial is said to be of order p. The p y p regression coefficients a, b1, . . . ,bp are determined using ŷ = a + b1 x + b2x2 + …+bpxp the same algorithm that was used for the simple linear model; the error sum of squares is simultaneously minimized with respect to the regression coefficients. The family of equations that must be solved to determine the regression coefficients is nightmarish, but most of the good statistical software packages have this capability. 65 66 11
  12. 12. 3/22/2010 POLYNOMIAL MODELS POLYNOMIAL MODELS• Although high-order polynomial models can fit • Because of their complexity, its important to the (x, y) data very well, they should be of the summarize the performance of polynomial lowest order possible that accurately represents the relationship between y and x. There are no models using r2adjusted instead of r2. In some clear guidelines on what order might be l id li h t d i ht b cases when there are relatively few error necessary, but watch the significance (that is, the degrees of freedom after fitting a large p values) of the various regression coefficients to polynomial model, the r2 value could be confirm that all of the terms are contributing to misleadingly large whereas r2adjusted will be the model. Polynomial models must also be much lower but more representative of the true hierarchical, that is, a model of order p must contain all possible lower-order terms. performance of the model. 67 68 POLYNOMIAL MODELS POLYNOMIAL MODELS• Fit the following data with an appropriate • Solution: model and use scatter plots and residuals scatter plots and  diagnostic plots to check for lack of fit. residuals diagnostic plots 69 70 POLYNOMIAL MODELS POLYNOMIAL MODELS• Solution: SCATTER PLOT • Solution: SCATTER PLOT 71 72 12
  13. 13. 3/22/2010 POLYNOMIAL MODELS POLYNOMIAL MODELS• Solution: Residuals diagnostic plots • Solution: Residuals diagnostic plots 73 74 POLYNOMIAL MODELS POLYNOMIAL MODELS• Solution: Residuals diagnostic plots • Solution: Residuals diagnostic plots 75 76 POLYNOMIAL MODELS POLYNOMIAL MODELS• Solution: Quadratic Create x^2  Column • Solution: Quadratic Model 77 78 13
  14. 14. 3/22/2010 POLYNOMIAL MODELS• Solution: Quadratic Model Stat > Regression > Fitted Line Plot (x,y) – Quadratic Multiple Regression 79 80 Multiple Regression Multiple Regression• When a response has n quantitative predictors • This equation has the same basic structure as such as y (x1 x2, .. . , xn), the model for y must the polynomial model and, in fact, the two be created by multiple regression. In multiple models are fitted and analyzed in much the regression each predictive term in the model same way. Where the work-sheet to fit the polynomial model requires n columns, one for has its own regression coefficient. The each power of x, the worksheet to fit the simplest multiple regression model contains a multiple regression model requires n columns linear term for each predictor: to account for each of the n predictors. The same regression methods are used to analyze both problems. 81 82 Multiple Regression Multiple Regression• Frequently, the simple linear model in does not • PROBLEM • Selling Price Table (in thousands of dollars) fit the data and a more complex model is A real‐estate executive  would like to be able to  required. The terms that must be added to the predict the cost of a house  model to achieve a good fit might involve in a housing development  in a housing development interactions, quadratic terms, or terms of even on the basis of the number  of bedrooms and bath‐ higher order. Such models have the basic form: rooms in the house.  83 84 14
  15. 15. 3/22/2010 Multiple Regression Multiple Regression• The following first-order model is assumed to connect the • MINITAB SOLUTION selling price of the home with the number of bedrooms and the • Stat > Regression > Regression. number of baths. The dependent variable is represented by y and the independent variables are x1,the number of bedrooms, and x2, the number of baths. 85 86 Multiple Regression Multiple Regression• MINITAB Output – Stat > Regression > Regression. 87 88 Multiple Regression Multiple Regression• Problem Blood pressure  The following table contains data from a blood pressure study. The study on fifty  data were collected on a group of middle aged men. Systolic is the systolic blood pressure, Age is the age of the individual, Weight is middle‐aged men. the weight in pounds, Parents indicates whether the individuals parents had high blood pressure: 0 means neither parent has high blood pressure, 1 means one parent has high blood pressure, and 2 means both mother and father have high blood pressure, Med is the number of hours per month that the individual meditates, and TypeA is a measure of the degree to which the individual exhibits type A personality behavior, as determined from a form that the person fills out. Systolic is the dependent variable and the other five variables are the independent variables 89 90 15
  16. 16. 3/22/2010 Multiple Regression Multiple Regression • MINITAB SOLUTION : Stat > Regression > Regression• Model Y = systolic, xl = age, x2 = weight, x3 = parents, x4 = med, and x5 = Type A 91 92 Multiple Regression Multiple Regression Checking the Overall Utitity of a Model• MINITAB SOLUTION : Stat > Regression > Regression • Purpose: Check whether the model is useful and to control your α value Rather than conduct a large group of t-tests on the betas and increase the probability of making a type 1 error make one test and know that α= 0.05. The F-test is such a test. It is contained in the analysis of variance associated with the analysis. The F-test tests the following The five hypothesis tests suggest Weight and Type A should be kept and  hypothesis associated with the blood pressure model the other three variables thrown out. 93 94 Multiple Regression• MINITAB SOLUTION : Stat > Regression > Regression Interpretation As is seen F= 18.50 with a  p‐value of 0.000 and the null  END hypothesis should be rejected;  the conclusion is that at least  one βi ≠ 0. This F‐test says that  the model is useful in predicting  systolic blood pressure. 95 96 16