Upcoming SlideShare
×

# Two variable linear model

351 views

Published on

0 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

Views
Total views
351
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
5
0
Likes
0
Embeds 0
No embeds

No notes for slide

### Two variable linear model

1. 1. Chapter 1The Two Variable Linear Model1.1 The Basic Linear ModelThe goal of this section is to build a simple model for the non-exact relationship betweentwo variables Y and X, related by some economic theory. For example, consumption andincome, quantity consumed and price, etc. The proposed model: Yi = α + βXi + ui , i = 1, . . . , n (1.1)where α and β are unknown parameters which are the purpose of the estimation. What wewill call ‘data’ are the n realizations of (Xi , Yi ). We are abusing notation a bit by using thesame letters to refer to random variables and their realizations. ui is an unobserved random variable which represents the fact that the relationship be-tween Y and X is not exactly linear. We will momentarily assumet that ui has expectedvalue zero. Note that if ui = 0, then the relationship between Yi and Xi would be exactlylinear, so it is the presence of ui what breaks this exact nature of the relationship. Y is usu-ally reﬀered to as the explained or dependent variable, X is the explanatory or independentvariable. We will refer to ui as the ‘error term’, which is a terminology more appropriate inthe experimental sciences, where a cause x (say the dose of a drug) is administered todiﬀerent subjects and then an eﬀect y is measured (say, body temperature). In this case uimight be a measurement error due to the erratic behavior of a measurement instrument (forexample, a thermometer). In a social science like economics, ui represents a broader notionof ‘ignorance’ that represents whatever is not observed (by ignorance, ommision, etc.) thataﬀects y besides x. [ FIGURE 1: SCATTER DIAGRAM ] The ﬁrst goal will be to ﬁnd reasonable estimates for α and β based solely on the data,that is (Xi , Yi ), i = 1, . . . , n. 1
2. 2. CHAPTER 1. THE TWO VARIABLE LINEAR MODEL 21.2 The Least Squares Method ˆLet us denote with α and β the estimates of α and β in the simple linear model. Let us ˆalso deﬁne the following quantities. The ﬁrst one is an estimate of Y : ˆ ˆ ˆ Yi ≡ α + βXiIntuitively, we have replaced α and β by its estimates, and treated ui as if the relationshipwere exactly linear, i.e., as if ui were zero. This will be undersood as an estimate of Yi .Then it is natural to deﬁne a notion of estimation error as follows: ˆ ei ≡ Yi − Yiwhich measures the diﬀerence between Yi and its estimate. ˆ A natural goal is to ﬁnd α and β so as ei ’s are ‘small’ in some sense. It is interesting to ˆsee how the problem works from a graphical perspective. Data will correspond to n pointsscattered in a (X, Y ) plane. The presence of a linear relationship like (1.1) is consistentwith points scatterd around an imaginary straight line. Note that if ui where indeed zero,all points will lie along the same line, consistent with an exact linear relationship. Asmentioned above, it is the presence of ui what breaks this exact relationship. ˆ Now note that for any given values of α and β, the points determined by the ﬁtted ˆmodel: ˆ ˆ ˆ Y ≡ α + βX ˆcorrespond to a line in the (X, Y ) plane. Hence diﬀerent values of α and β correspond ˆto diﬀerent estimated lines, which implies that choosing particular values is equivalent tochoosing a speciﬁc line on the plane. For the i-th observation, the estimation errors ei can ˆbe seen graphically as the vertical distance between the points (Xi , Yi ) and (Xi , Yi ), that ˆis, between (Xi , Yi ) and the ﬁtted line. So, intuitively, we want values of α and β so as the ˆﬁtted line they induce passes as close as possible to all the points in the scatter so errorsare as small as possible. [ FIGURE 2: SCATTER DIAGRAM WITH ‘CANDIDATE’ LINE] Note that if we had only two observations, the problem has a very simple solution, and ˆreduces to ﬁnding the only two values of α and β that make estimation errors exactly equal ˆto zero. Graphically, this is possible since this is equivalent to ﬁnding the only straightline that passes through the two observations available. Trivially, in this extreme case allestimation errors will be zero. The more realistic case appears when we have more than two observations, not all ofthem lying on a single line. Obviously, a line cannot pass through more than two non-aligned points, so we cannot make all errors equal to zero. So now the problem is to ﬁnd ˆvalues of α and β that determine a line that passes the closest as posible to all the points, ˆso estimation errors are, in the aggregate, small. For this we need to introduce a criterion
3. 3. CHAPTER 1. THE TWO VARIABLE LINEAR MODEL 3of what do we mean by the line being close or far from the points. Let us deﬁne a penaltyfunction, which consists in adding all the estimation errors squared, so as positive and ˆnegative errors matter alike. For any α and β, this will give us an idea of how large is the ˆaggregate estimation error: n α ˆ SSR(ˆ , β) = e2 = i ˆ ˆ (Yi − α − βXi )2 i=1 SSR stands for sum of squared residuals. Note that given the observations Yi and Xi , ˆ ˆthis is a function that depends on α and β, that is, diﬀerent values of α and β correspond ˆ ˆto diﬀerent lines that pass through the data points, implying diﬀerent estimation errors. It ˆis now natural to look for α and β so as to make this aggregate error as small as possible. ˆ The values of α and β ˆ ˆ that minimize the sum of squared residuals are: ¯ ¯ Xi Yi − nY X ˆ β= Xi ¯ 2 − nX 2and ¯ ˆ¯ α = Y − βX ˆwhich are known as the least squares estimators of β and α. Derivation of the Least Squares Estimators The next paragraphs show how to obtain these estimators. Fortunately, it is easy to α ˆ show that SRC(ˆ , β) is globally concave and diﬀerentiable, so ﬁrst order conditions for a local minimum are: α ˆ ∂SRC(ˆ , β) = 0 ∂α ˆ α ˆ ∂SRC(ˆ , β) = 0 ˆ ∂β The ﬁrst order condition is: ∂ e2 = −2 ˆ ˆ (Yi − α − βXi ) = 0 (1.2) ∂α ˆ Dividing by minus 2 and distributing the summations: α ˆ Yi = nˆ + β Xi (1.3) This last expression is very important, and we will return to it frequently. From the second ﬁrst order condition:
4. 4. CHAPTER 1. THE TWO VARIABLE LINEAR MODEL 4 ∂ e2 = −2 ˆ ˆ Xi (Yi − α − βXi ) = 0 (1.4) ˆ ∂β Dividing by -2 and distributing the summations: Xi Yi = α ˆ ˆ Xi + β 2 Xi (1.5) ˆ ˆ (1.3) and (1.5) form a system of two linear equations with two unknowns (α y β) known as the normal equations. Dividing (1.3) by n and solving for α we get: ˆ ¯ ˆ¯ α = Y − βX ˆ (1.6) Replacing in (1.5): Xi Yi ¯ ˆ¯ = (Y − β X) ˆ Xi + β Xi 2 Xi Yi ¯ = Y ˆ¯ Xi − β X ˆ Xi + β Xi 2 ¯ Xi Yi − Y Xi ˆ = β ¯ Xi 2 − X Xi ¯ Xi Yi − Y Xi ˆ β= Xi 2−X ¯ Xi ¯ Note that: X = Xi /n then ¯ Zi = Zn. Replacing, we get: ¯ ¯ Xi Yi − nY X ˆ β= (1.7) Xi ¯ 2 − nX 2 ¯ ¯ It will be useful to adopt the following notation. xi = Xi − X, and yi = Yi − Y , solowercase letters denote the observations as deviations from their sample means. Using this notation: xi yi = ¯ ¯ (Xi − X)(Yi − Y ) = ¯ ¯ ¯¯ (Xi Yi − Xi Y − XYI + X Y ) = ¯ Xi Yi − Y ¯ Xi − X ¯¯ Yi + nX Y = ¯ ¯ ¯¯ ¯¯ Xi Yi − nY X − nX Y + nX Y = ¯ ¯ Xi Yi − nY X
5. 5. CHAPTER 1. THE TWO VARIABLE LINEAR MODEL 5corresponds to the numerator of (1.7). Making a similar operation in the denominator of(1.7) we get the following alternative expression for the least squares estimate of β: ˆ xi yi β= x2i [ FIGURE 3: SCATTER DIAGRAM AND OLS LINE ]1.3 Algebraic Properties of Least Squares EstimatorsBy algebraic properties of the estimator we mean those that are a direct consequence ofthe minimizacion process, stressing the diﬀerence with statistical properties, which will bestudied in the next section. • Property 1: ei = 0 From the ﬁrst normal equation (1.2), dividing by minus 2 and replacing by the deﬁnition of ei we easily verify that as a consequence of minimizing the sum of squared residuals, the sum of the residuals, and consequently their average, is equal to zero. • Property 2: Xi ei = 0. This can be checked by dividing by minus 2 in the second normal equation (1.4). The covariance between X and e is given by: 1 ¯ Cov(X, e) = (Xi − X)(ei − e) ¯ n−1 1 ¯ ¯¯ = X i ei − e ¯ Xi − X ei + Xe n−1 1 = X i ei n−1 since from the previous property ei and hence e are equal to zero. Then, this ¯ property says that as a consequence of using the method of least squares the sample covariance between the explanatory variable X and the error term e is zero, or, which is the same, the residuals are linearly unrelated to the explanatory variable. ˆ ˆ ˆ • Property 3: The estimated regression line corresponds to the function Y (X) = α+ βX ˆ as parameters, so as Y is a function that depends on X. Consider where e take α and β ˆ ˆ ¯ what happens when we evaluate this function at X, the mean of X: ˆ ¯ ˆ ˆ¯ Y (X) = α + β X But from (1.6): ˆ ˆ¯ ¯ α + βX = Y ˆ ¯ ¯ Then Y (X) = Y , this is, the estimated regression line by the method of least squares passes through the point of means.
6. 6. CHAPTER 1. THE TWO VARIABLE LINEAR MODEL 6 • Property 4: Relationship between regression and correlation: Remember that the sample correlation coeﬃcient between X and Y for a sample of n observations (Xi , Yi ), i = 1, 2, . . . , n is deﬁned as: Cov(X, Y ) rXY = SX SY ˆ The following result establishes the relationship between rXY and β. ˆ xi yi β = x2i xi yi = x2 i x2 i 2 yi xi yi = x2 i x2 i 2 yi 2 √ xi yi yi / n = √ x2 i 2 yi x2 / n i ˆ SY β=r SX ˆ If r = 0 then β = 0.Note that if both variables have the same sample variance, then ˆ the correlation coeﬃcient is equal to the regression coeﬃcient β. We can also see ˆ is not invariant to changes in scales or unit that, unlike the correlation coeﬃcient, β of measurement. ˆ ˆ • Property 5: The sample means of Yi and Yi are the same. By deﬁnition, Yi = Yi + ei for i = 1, . . . , n. Then, summing for every i: Yi = ˆ Yi + ei and dividing by n: Yi ˆ Yi = n n since ei = 0 from the ﬁrst order conditions. Then: ¯ ¯ ˆ Y =Y which is the desired result.
7. 7. CHAPTER 1. THE TWO VARIABLE LINEAR MODEL 7 ˆ ˆ ˆ • Property 6: β is a linear function of the Yi ’s. This is, β can be written as β = wi Yi , where the wi ’s are real numbers not all of them equal to zero. ˆ This is easy to prove. Let us start by writing β as follows: ˆ xi β= yi x2i and call wi = xi / x2 . Note that: i xi = ¯ (Xi − X) = ¯ Xi − nX = 0 which implies wi = 0. From the previous result: ˆ β = wi yi = ¯ wi (Yi − Y ) = ¯ wi Yi − Y wi = wi Yi which gives the desired result. This does not have much intuitive meaning so far, but it will be a useful for later results.1.4 The Two-Variable Linear Model under the Classical As- sumptions Yi = α + βXi + ui , i = 1, . . . , n In addition the the linear relationhips beteween Y and X we will assume: 1. E(ui ) = 0, i = 1, 2, . . . , n. ‘On average’ the relationship between Y and X is linear. 2. V ar(ui ) = E[(ui − E(ui ))2 ] = Eu2 = σ 2 i = 1, 2, . . . , n. The variance of the error i term is constant for all observations. We will say that the error term is homoskedastic. 3. Cov(ui , uj ) = 0 ∀i = j. The error term for an observation i is not linearly related to the error term of any other diﬀerent observation j. If variables are measured over time, i.e., i = 1980, 1981 . . . , 1997 we will say that there is no autocorrelation. In general, we will say that there is no serial correlation. Note that since E(ui ) = 0, assuming Cov(ui , uj ) = 0 is equivalent to assuming E(ui uj ) = 0. 4. The values of Xi are non-stochastic and not all of them equal.
8. 8. CHAPTER 1. THE TWO VARIABLE LINEAR MODEL 8 The classical assumptions provide a basic probabilistic structure to study the linearmodel. Most assumptions are of a pedagogic nature and we will study later on how theycan be relaxed. Nevertheless, they provide a simple framework to explore the nature ofleast squares estimator.1.5 Statistical Properties of Least Squares EstimatorsActually, the problem is to ﬁnd good estimates of α, β and σ 2 . The previous sectionpresents estimates of the ﬁrst two based on the principle of least squares so, trivially, theseestimates are ‘good’ in the sense that they minimize certain notion of ﬁt: they make thesum of squared residuals as small as possible. It is relevant to remark that in obtaining theleast squares estimators we have made no use of the classical assumptions described above.Hence, the natural step is to explore whether we can deduce additional properties satisﬁedby the least squares estimator, so we can say that it is good in a sense that goes beyondthat implicit in the least squares criterion. The following are called statistical propertiessince they arise as a consequence of the statistical structure of the model. We will use repeatedly the following expressions for the LS estimators: ˆ xi yi β= x2i ¯ ˆ¯ α = Y − βX ˆ ˆ We will ﬁrst explore the main properties of β in detail, and leave the analysis of α as ˆ ˆexercises. The starting conceptual point is to see that β depends explicitely on the Yi ’s ˆwhich, in turn, depend on the ui ’s which are, by construction, random variables. Then β isa random variable and then it makes sense to talk about its moments (mean and variance,for example) and its distribution. It is easy to verify that: yi = xi β + u∗ iwhere u∗ = ui − u, and, according to the classical assumptions, E(u∗ ) = 0 and, consequently, i ¯ iE(yi ) = xi β. This is known as the classical two-variables linear model in deviations formthe means. ˆ ˆ • β is an unbiased estimator, that is: E(β) = β To prove the result, from the linearity property of the previous section ˆ β = wi yi ˆ E(β) = wi E(yi ) (wi ’s are non-stochastic) = wi xi β
9. 9. CHAPTER 1. THE TWO VARIABLE LINEAR MODEL 9 = β wi xi = β x2 /( i x2 ) i = β ˆ • The variance of β is σ 2 / x2 i ˆ From the linearity property, β = wi Yi , then ˆ V (β) = V wi Yi Now note two things. First: V (Yi ) = V (α + βXi + ui ) = V (ui ) = σ 2 since Xi is non-stochastic. Second, note that E(Yi ) = α + βXi , so Cov(Yi , Yj ) = E [(Yi − E(Yi ))(Yj − E(Yj ))] = E(ui uj ) = 0 by the no serial correlation assumption. Then V ( wi Yi ) is the variance of (weighted) sum of uncorrelated terms. Hence ˆ V (β) = V wi Yi 2 = wi V (Yi ) = σ2 2 wi 2 = σ2 (x2 )/ i x2 i = σ2 / 2 xi ˆ • Gauss-Markov Theorem: under the classical assumptions, β, the LS estimator of β, has the smallest variance among the class of linear and unbiased estimators. More formally, if β ∗ is any linear and unbiased estimator of β then: ˆ V (β ∗ ) ≥ V (β) The proof of a more general version of this result will be postponed until Chapter 3. Discussion: BLUE, best does not mean good, we want minimum variance unbiased (without ‘linear’), ‘linear’ is not an interesting class, etc. If we drop any assumption, the OLS estimate is no longer BLUE. This justiﬁes the use of OLS when all the asumptions are correct.
10. 10. CHAPTER 1. THE TWO VARIABLE LINEAR MODEL 10Estimation of σ 2So far we have concentrated the analysis on α and β. As an estimate for σ 2 we will propose: e2 S2 = i n−2We will later show that S 2 provides and unbiased estimator for σ 2 .1.6 Goodness of ﬁtAfter estimating the parameters of the regression line, it is interesting to check how welldoes the estimated model ﬁt the data. We want a measure of how well does the ﬁtted linerepresent the observations of the variables of the model. To look for such measure of goodness of ﬁt, we start from the deﬁnition of ﬁtted value ˆei = Yi − Yi , solve for Yi and substract in both members the sample mean of Yi to obtain: ¯ Yi − Y ˆ ¯ = Yi − Y + ei y i = y i + ei ˆ ¯ ¯ ˆusing the notation deﬁned before and noting that from Property 4, Y = Y . Taking thesquare of both sides and summing over all the observations: yi = (ˆi + ei )2 2 y ˆ2 = yi + ei + 2ˆi ei y 2 yi = ˆ2 yi + e2 + 2 i y i ei ˆ The next step is to show that yi ei = 0: ˆ y i ei = ˆ α ˆ (ˆ + βXi )ei = α ˆ ˆ ei + β Xi ei = 0+0from the ﬁrst order conditions. Then we get the following important decomposition: 2 yi = yi 2 + ˆ e2 i T SS = ESS + RSSThis is a key result that indicates that when the we use the least squares method, the totalvariability of the dependent variable (TSS) around its sample mean can be decomposed ˆas the sum of two factors. The ﬁrst one corresponds to the variability of Y (ESS) andrepresents the variability explained by the ﬁtted model. The second term represents thevariability not explained by the model (RSS), associated to the error term.
11. 11. CHAPTER 1. THE TWO VARIABLE LINEAR MODEL 11 For a given model, the best situation arises when errors are all zero, in which casethe total variability (TSS) conincides with the explained varaibility (ESS). The worst casecorresponds to the situation in which the ﬁtted model does not explain anything of the totalvariability, in which case TSS coincides with RSS. From this observation, it is natural tosuggest the following goodness of ﬁt measure, known as R2 , or coeﬃcient of determination: SCE SCR R2 = =1− SCT SCT It can be shown (we will do it in the exercises) that R2 = r2 . Consequently, 0 ≤ R2 ≤ 1.When R2 = 1 |r| = 1, which corresponds to the case in which the relationship betweenY and X is exactly linear. On the other hand, R2 = 0 is equivalent to r = 0, whichcorresponds to the case in which Y and X are linearly unrelated. It is interesting to notethat T SS does not depend on the estimated model, that is, it does not depend on β nor ˆα. Then, if βˆ ˆ and α are choosen so as to minimize SSR then they automatically maximize ˆR2 . This implies that, for a given model, the least squares estimate maximizes R2 . The R2 is, arguably, the most used and abused measure of quality of a regression model.A detailed analysis of the extent to which a high R2 can be taken as representative of a‘good’ model will be undertaken in Chapter 4.1.7 Inference in the two-variable linear modelThe methods discussed so far provide reasonably good point estimates of the parametersof interest α, β and σ 2 but usually we will be interested in evaluating hypotheses involvingthe parameters, or constructing conﬁdence intervals for them. For example, consider thecase of a simple consumption function where consumption is speciﬁed as a simple linearfunction of income. We could be interested in evaluating whether the marginal propensityto consume is equal to, say, 0.75, or that autonomous consumption is equal to zero. In general terms, a hypothesis about a parameter of the model is a conjecture aboutit, that can be either false or true. The central problem is that in order to check whethersuch statement is true or false we do not have the chance to observe such a parameter.Instead, based on the available data, we have an estimate of it. As an example, supposewe are interested in evaluating the, rather strong, null hypothesis that income is not anexplanatory factor of consumption, against the hypothesis that it is a relevant factor. Inour simple setup this corresponds to H0 : β = 0 against HA : β = 0. The logic we will use isthe following: if the null hypothesis were in fact true β would be exactly zero. Realizations ˆ ˆof β can potentially take any value, since β is, by construction, a random variable. But ifβˆ is a ‘good’ estimator of β, when the null hypothesis is true it should take values close ˆto zero. On the other hand, if the null hypothesis were false, the realizations of β shouldbe signiﬁcantly diﬀerent from zero. Then, the procedure consists in computing β ˆ from thedata, and reject the null if the obtained value is signiﬁcantly diﬀerent from zero, or acceptotherwise.
12. 12. CHAPTER 1. THE TWO VARIABLE LINEAR MODEL 12 Of course, the central concept behind this procedure lies in specifying what do we mean ˆby ‘very close’ or ‘very far’, given that β is a random variable. More speciﬁcally, we needto know the distribution of β ˆ under the null hypothesis so we can deﬁne precisely thenotion of ‘signiﬁcantly diﬀerent from zero’. In this context such a statement is necessarilyprobabilistic, that is, we will take as the rejection region a set of values that lie ‘far away’from zero, or, a set of values that under the null hypothesis appear with very low probability. The properties discussed in the previous section are informative about certain moments ˆ or α (for example, their means and variances) but they are not enough for the purposesof β ˆof knowing their distrubutions. Consequently, we need to introduce an additional assump-tion. We will assume that ui is normally distributed, for i = 1, . . . , n. Given that we havealready assumed that ui has zero mean and constant variance equal to σ 2 , we have: ui ∼ N (0, σ 2 ) Given that Yi = α + βXi + ui and that the Xi ’s are non-stochastic, we immediatelysee that the Yi ’s are also normally distributed since linear transformations of normal ran-dom variables are also normal. In particular, given that the normal distibution can becharacterized by its mean and variance only, we get: Yi ∼ N (α + βXi , σ 2 ) ˆ, for every i = 1 . . . , n. In a similar fashion β is also normally distributed since by Property1 it is a linear combination of the Yi ’s, that is: ˆ β ∼ N (β, σ 2 / x2 ) i If σ 2 were known we could use this result to test simple hypothesis like: Ho : β = βo vs. HA : β = βo ˆSubstracting from β its expected value and dividing by its standard deviation we get: ˆ β − βo z= ∼ N (0, 1) σ/ x2 iHence, if the null hypothesis is true, z should take values that are small in absolute value, andlarge otherwise. As you should remember from a basic statistics course, this is acomplishedby deﬁning a rejection region and an acceptance region as follows. The acceptance regionincludes values that lie close to the one corresponding to the null hypothesis. Let c < 1 andzc be a number such that: P r(−zc ≤ z ≤ zc ) = 1 − cReplacing z by its deﬁnition:
13. 13. CHAPTER 1. THE TWO VARIABLE LINEAR MODEL 13 P r βo − zc σ/ ˆ x2 ≤ β ≤ βo + zc σ/ x2 =1−c i i Then the acceptance region is given by the interval: βo ± zc (σ/ x2 ) i ˆso we accept the null hypothesis if the observed realization of β lies within this interval andreject otherwise. The number c is speciﬁed in advance and it is usually a small number. Itis called the signiﬁcance of the test. Note that it gives the probability that we reject thenull hypohtesis when it is correct. Under the normality assumptions, the value zc can beeasily obtained from a table of percentiles of the standard normal distribution. As you should also remember from a basic statistics class, a similar logic can be appliedto construct a conﬁdence interval for β0 . Note that: ˆ P r β − zc (σ/ ˆ x2 ) ≤ βo ≤ β + zc (σ/ x2 ) = 1 − c i iThen a 1 − c conﬁdence interval for β0 will be given by: ˆ β ± zc σ/ x2 i The practical problem with the previous procedures is that they require that we knowσ 2 , which is usually not available. Instead, we can compute its estimated version S 2 . Deﬁnet as: ˆ β−β t= √ S/ x2t is simply z where we have replaced σ 2 by S 2 . A very important result is that by doingthis replacement we have: t ∼ tn−2that is, the ‘t-statistic’ has the so-called ‘t-distribution with n − 2 degrees of freedom’.Hence, when we use the estimated version of the variance we obtain a diﬀerent distributionfor the statistic used to test simple hypotheses and construct conﬁdence intervals. Consequently, applying once again the same logic, in order to test the null hypothesisHo : β = βo against HA : β = βo we use the t-statistic: ˆ β − βo t= ∼ tn−2 S/ x2 i
14. 14. CHAPTER 1. THE TWO VARIABLE LINEAR MODEL 14and a 1 − c conﬁdence interval for β0 will be given by: ˆ β ± tc (S/ x2 ) iwhere now tc is a percentile of the ‘t’ distribution with n − 2 degrees of freedom, which isusually tabulated in basic statistics and econometrics textbooks. An important particular case is the insigniﬁcance hypothesis, that is Ho : βo = 0 againstHA : β0 = 0. Under the null X does not help explain Y , and under the alternative, X islinearly related to Y . Replacing βo by 0 above we get: ˆ β tI = ∼ tn−2 S/ x2 iwhich is usually reported as a standard outcome in most regression packages. Another alternative to check for the signiﬁcance of the linear relationship is to look athow large is the explained sum of squares ESS. Recall that if the model has an interceptwe have that: T SS = ESS + RSSIf there is no linear relationship between Y and X, ESS should be very close to zero.Consider the following statistic, which is just a ‘standardized’ version of the ESS: ESS F = RSS/(n − 2)It can be shown that under the normality assumption, F has the F − distribution with1 degree of freedom in the numerator, and n − 2 degrees of freedom in the denominator,which is usually labeled as F (1, n − 2). Note that if X does not help explain Y in a linearsense, ESS should be very small, which would make F very small. Then, we should rejectthe null hypothesis that X does not help explain Y is the F statistic computed from thedata takes a large value, and accept otherwise. Note that by deﬁnition R2 = ESS/T SS = 1 − RSS/T SS. Divide both the numeratorof the F statistic by T SS. Solving for ESS and RSS and replacing above we can write theF statistic in terms of the R2 coeﬃcient as: R2 F = (1 − R2 )/(n − 2)Then, the F test is actually looking at whether the R2 is signiﬁcantly high. As it is expected,there is a close relationship between the F statistic and the ‘t’ statistic for the insigniﬁcancehypothesis (tI ). In fact, when there is no linear relationship between Y and X, ESS is zero,or β0 = 0. In fact, it can be easily shown that: F = t2 IWe will leave the proof as an excercise.