Your SlideShare is downloading. ×
Greene  econometric analysis
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Greene econometric analysis


Published on



Published in: Economy & Finance, Technology

  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. Greene-50240 gree50240˙FM July 10, 2002 12:51 FIFTH EDITION ECONOMETRIC ANALYSIS Q William H. Greene New York University Upper Saddle River, New Jersey 07458 iii
  • 2. Greene-50240 gree50240˙FM July 10, 2002 12:51 CIP data to come Executive Editor: Rod Banister Editor-in-Chief: P. J. Boardman Managing Editor: Gladys Soto Assistant Editor: Marie McHale Editorial Assistant: Lisa Amato Senior Media Project Manager: Victoria Anderson Executive Marketing Manager: Kathleen McLellan Marketing Assistant: Christopher Bath Managing Editor (Production): Cynthia Regan Production Editor: Michael Reynolds Production Assistant: Dianne Falcone Permissions Supervisor: Suzanne Grappi Associate Director, Manufacturing: Vinnie Scelta Cover Designer: Kiwi Design Cover Photo: Anthony Bannister/Corbis Composition: Interactive Composition Corporation Printer/Binder: Courier/Westford Cover Printer: Coral Graphics Credits and acknowledgments borrowed from other sources and reproduced, with permission, in this textbook appear on appropriate page within text (or on page XX). Copyright © 2003, 2000, 1997, 1993 by Pearson Education, Inc., Upper Saddle River, New Jersey, 07458. All rights reserved. Printed in the United States of America. This publication is protected by Copyright and permission should be obtained from the publisher prior to any prohibited reproduction, storage in a retrieval system, or transmission in any form or by any means, electronic, mechanical, photocopying, recording, or likewise. For information regarding permission(s), write to: Rights and Permissions Department. Pearson Education LTD. Pearson Education Australia PTY, Limited Pearson Education Singapore, Pte. Ltd Pearson Education North Asia Ltd Pearson Education, Canada, Ltd Pearson Educación de Mexico, S.A. de C.V. Pearson Education–Japan Pearson Education Malaysia, Pte. Ltd 10 9 8 7 6 5 4 3 2 1 ISBN 0-13-066189-9 iv
  • 3. Greene-50240 gree50240˙FM July 10, 2002 12:51 BRIEF CONTENTS Q Chapter 1 Introduction 1 Chapter 2 The Classical Multiple Linear Regression Model 7 Chapter 3 Least Squares 19 Chapter 4 Finite-Sample Properties of the Least Squares Estimator 41 Chapter 5 Large-Sample Properties of the Least Squares and Instrumental Variables Estimators 65 Chapter 6 Inference and Prediction 93 Chapter 7 Functional Form and Structural Change 116 Chapter 8 Specification Analysis and Model Selection 148 Chapter 9 Nonlinear Regression Models 162 Chapter 10 Nonspherical Disturbances—The Generalized Regression Model 191 Chapter 11 Heteroscedasticity 215 Chapter 12 Serial Correlation 250 Chapter 13 Models for Panel Data 283 Chapter 14 Systems of Regression Equations 339 Chapter 15 Simultaneous-Equations Models 378 Chapter 16 Estimation Frameworks in Econometrics 425 Chapter 17 Maximum Likelihood Estimation 468 Chapter 18 The Generalized Method of Moments 525 Chapter 19 Models with Lagged Variables 558 Chapter 20 Time-Series Models 608 Chapter 21 Models for Discrete Choice 663 Chapter 22 Limited Dependent Variable and Duration Models 756 Appendix A Matrix Algebra 803 Appendix B Probability and Distribution Theory 845 Appendix C Estimation and Inference 877 Appendix D Large Sample Distribution Theory 896 vii
  • 4. Greene-50240 gree50240˙FM July 10, 2002 12:51 viii Brief Contents Appendix E Computation and Optimization 919 Appendix F Data Sets Used in Applications 946 Appendix G Statistical Tables 953 References 959 Author Index 000 Subject Index 000
  • 5. Greene-50240 gree50240˙FM July 10, 2002 12:51 CONTENTS Q CHAPTER 1 Introduction 1 1.1 Econometrics 1 1.2 Econometric Modeling 1 1.3 Data and Methodology 4 1.4 Plan of the Book 5 CHAPTER 2 The Classical Multiple Linear Regression Model 7 2.1 Introduction 7 2.2 The Linear Regression Model 7 2.3 Assumptions of the Classical Linear Regression Model 10 2.3.1 Linearity of the Regression Model 11 2.3.2 Full Rank 13 2.3.3 Regression 14 2.3.4 Spherical Disturbances 15 2.3.5 Data Generating Process for the Regressors 16 2.3.6 Normality 17 2.4 Summary and Conclusions 18 CHAPTER 3 Least Squares 19 3.1 Introduction 19 3.2 Least Squares Regression 19 3.2.1 The Least Squares Coefficient Vector 20 3.2.2 Application: An Investment Equation 21 3.2.3 Algebraic Aspects of The Least Squares Solution 24 3.2.4 Projection 24 3.3 Partitioned Regression and Partial Regression 26 3.4 Partial Regression and Partial Correlation Coefficients 28 3.5 Goodness of Fit and the Analysis of Variance 31 3.5.1 The Adjusted R-Squared and a Measure of Fit 34 3.5.2 R-Squared and the Constant Term in the Model 36 3.5.3 Comparing Models 37 3.6 Summary and Conclusions 38 ix
  • 6. Greene-50240 gree50240˙FM July 10, 2002 12:51 x Contents CHAPTER 4 Finite-Sample Properties of the Least Squares Estimator 41 4.1 Introduction 41 4.2 Motivating Least Squares 42 4.2.1 The Population Orthogonality Conditions 42 4.2.2 Minimum Mean Squared Error Predictor 43 4.2.3 Minimum Variance Linear Unbiased Estimation 44 4.3 Unbiased Estimation 44 4.4 The Variance of the Least Squares Estimator and the Gauss Markov Theorem 45 4.5 The Implications of Stochastic Regressors 47 4.6 Estimating the Variance of the Least Squares Estimator 48 4.7 The Normality Assumption and Basic Statistical Inference 50 4.7.1 Testing a Hypothesis About a Coefficient 50 4.7.2 Confidence Intervals for Parameters 52 4.7.3 Confidence Interval for a Linear Combination of Coefficients: The Oaxaca Decomposition 53 4.7.4 Testing the Significance of the Regression 54 4.7.5 Marginal Distributions of the Test Statistics 55 4.8 Finite-Sample Properties of Least Squares 55 4.9 Data Problems 56 4.9.1 Multicollinearity 56 4.9.2 Missing Observations 59 4.9.3 Regression Diagnostics and Influential Data Points 60 4.10 Summary and Conclusions 61 CHAPTER 5 Large-Sample Properties of the Least Squares and Instrumental Variables Estimators 65 5.1 Introduction 65 5.2 Asymptotic Properties of the Least Squares Estimator 65 5.2.1 Consistency of the Least Squares Estimator of β 66 5.2.2 Asymptotic Normality of the Least Squares Estimator 67 5.2.3 Consistency of s 2 and the Estimator of Asy. Var[b] 69 5.2.4 Asymptotic Distribution of a Function of b: The Delta Method 70 5.2.5 Asymptotic Efficiency 70 5.3 More General Cases 72 5.3.1 Heterogeneity in the Distributions of xi 72 5.3.2 Dependent Observations 73 5.4 Instrumental Variable and Two Stage Least Squares Estimation 74 5.5 Hausman’s Specification Test and an Application to Instrumental Variable Estimation 80
  • 7. Greene-50240 gree50240˙FM July 10, 2002 12:51 Contents xi 5.6 Measurement Error 83 5.6.1 Least Squares Attenuation 84 5.6.2 Instrumental Variables Estimation 86 5.6.3 Proxy Variables 87 5.6.4 Application: Income and Education and a Study of Twins 88 5.7 Summary and Conclusions 90 CHAPTER 6 Inference and Prediction 93 6.1 Introduction 93 6.2 Restrictions and Nested Models 93 6.3 Two Approaches to Testing Hypotheses 95 6.3.1 The F Statistic and the Least Squares Discrepancy 95 6.3.2 The Restricted Least Squares Estimator 99 6.3.3 The Loss of Fit from Restricted Least Squares 101 6.4 Nonnormal Disturbances and Large Sample Tests 104 6.5 Testing Nonlinear Restrictions 108 6.6 Prediction 111 6.7 Summary and Conclusions 114 CHAPTER 7 Functional Form and Structural Change 116 7.1 Introduction 116 7.2 Using Binary Variables 116 7.2.1 Binary Variables in Regression 116 7.2.2 Several Categories 117 7.2.3 Several Groupings 118 7.2.4 Threshold Effects and Categorical Variables 120 7.2.5 Spline Regression 121 7.3 Nonlinearity in the Variables 122 7.3.1 Functional Forms 122 7.3.2 Identifying Nonlinearity 124 7.3.3 Intrinsic Linearity and Identification 127 7.4 Modeling and Testing for a Structural Break 130 7.4.1 Different Parameter Vectors 130 7.4.2 Insufficient Observations 131 7.4.3 Change in a Subset of Coefficients 132 7.4.4 Tests of Structural Break with Unequal Variances 133 7.5 Tests of Model Stability 134 7.5.1 Hansen’s Test 134 7.5.2 Recursive Residuals and the CUSUMS Test 135 7.5.3 Predictive Test 137 7.5.4 Unknown Timing of the Structural Break 139 7.6 Summary and Conclusions 144
  • 8. Greene-50240 gree50240˙FM July 10, 2002 12:51 xii Contents CHAPTER 8 Specification Analysis and Model Selection 148 8.1 Introduction 148 8.2 Specification Analysis and Model Building 148 8.2.1 Bias Caused by Omission of Relevant Variables 148 8.2.2 Pretest Estimation 149 8.2.3 Inclusion of Irrelevant Variables 150 8.2.4 Model Building—A General to Simple Strategy 151 8.3 Choosing Between Nonnested Models 152 8.3.1 Testing Nonnested Hypotheses 153 8.3.2 An Encompassing Model 154 8.3.3 Comprehensive Approach—The J Test 154 8.3.4 The Cox Test 155 8.4 Model Selection Criteria 159 8.5 Summary and Conclusions 160 CHAPTER 9 Nonlinear Regression Models 162 9.1 Introduction 162 9.2 Nonlinear Regression Models 162 9.2.1 Assumptions of the Nonlinear Regression Model 163 9.2.2 The Orthogonality Condition and the Sum of Squares 164 9.2.3 The Linearized Regression 165 9.2.4 Large Sample Properties of the Nonlinear Least Squares Estimator 167 9.2.5 Computing the Nonlinear Least Squares Estimator 169 9.3 Applications 171 9.3.1 A Nonlinear Consumption Function 171 9.3.2 The Box–Cox Transformation 173 9.4 Hypothesis Testing and Parametric Restrictions 175 9.4.1 Significance Tests for Restrictions: F and Wald Statistics 175 9.4.2 Tests Based on the LM Statistic 177 9.4.3 A Specification Test for Nonlinear Regressions: The P E Test 178 9.5 Alternative Estimators for Nonlinear Regression Models 180 9.5.1 Nonlinear Instrumental Variables Estimation 181 9.5.2 Two-Step Nonlinear Least Squares Estimation 183 9.5.3 Two-Step Estimation of a Credit Scoring Model 186 9.6 Summary and Conclusions 189 CHAPTER 10 Nonspherical Disturbances—The Generalized Regression Model 191 10.1 Introduction 191 10.2 Least Squares and Instrumental Variables Estimation 192 10.2.1 Finite-Sample Properties of Ordinary Least Squares 193 10.2.2 Asymptotic Properties of Least Squares 194 10.2.3 Asymptotic Properties of Nonlinear Least Squares 196
  • 9. Greene-50240 gree50240˙FM July 10, 2002 12:51 Contents xiii 10.2.4 Asymptotic Properties of the Instrumental Variables Estimator 196 10.3 Robust Estimation of Asymptotic Covariance Matrices 198 10.4 Generalized Method of Moments Estimation 201 10.5 Efficient Estimation by Generalized Least Squares 207 10.5.1 Generalized Least Squares (GLS) 207 10.5.2 Feasible Generalized Least Squares 209 10.6 Maximum Likelihood Estimation 211 10.7 Summary and Conclusions 212 CHAPTER 11 Heteroscedasticity 215 11.1 Introduction 215 11.2 Ordinary Least Squares Estimation 216 11.2.1 Inefficiency of Least Squares 217 11.2.2 The Estimated Covariance Matrix of b 217 11.2.3 Estimating the Appropriate Covariance Matrix for Ordinary Least Squares 219 11.3 GMM Estimation of the Heteroscedastic Regression Model 221 11.4 Testing for Heteroscedasticity 222 11.4.1 White’s General Test 222 11.4.2 The Goldfeld–Quandt Test 223 11.4.3 The Breusch–Pagan/Godfrey LM Test 223 11.5 Weighted Least Squares When is Known 225 11.6 Estimation When Contains Unknown Parameters 227 11.6.1 Two-Step Estimation 227 11.6.2 Maximum Likelihood Estimation 228 11.6.3 Model Based Tests for Heteroscedasticity 229 11.7 Applications 232 11.7.1 Multiplicative Heteroscedasticity 232 11.7.2 Groupwise Heteroscedasticity 235 11.8 Autoregressive Conditional Heteroscedasticity 238 11.8.1 The ARCH(1) Model 238 11.8.2 ARCH(q), ARCH-in-Mean and Generalized ARCH Models 240 11.8.3 Maximum Likelihood Estimation of the GARCH Model 242 11.8.4 Testing for GARCH Effects 244 11.8.5 Pseudo-Maximum Likelihood Estimation 245 11.9 Summary and Conclusions 246 CHAPTER 12 Serial Correlation 250 12.1 Introduction 250 12.2 The Analysis of Time-Series Data 253 12.3 Disturbance Processes 256
  • 10. Greene-50240 gree50240˙FM July 10, 2002 12:51 xiv Contents 12.3.1 Characteristics of Disturbance Processes 256 12.3.2 AR(1) Disturbances 257 12.4 Some Asymptotic Results for Analyzing Time Series Data 259 12.4.1 Convergence of Moments—The Ergodic Theorem 260 12.4.2 Convergence to Normality—A Central Limit Theorem 262 12.5 Least Squares Estimation 265 12.5.1 Asymptotic Properties of Least Squares 265 12.5.2 Estimating the Variance of the Least Squares Estimator 266 12.6 GMM Estimation 268 12.7 Testing for Autocorrelation 268 12.7.1 Lagrange Multiplier Test 269 12.7.2 Box and Pierce’s Test and Ljung’s Refinement 269 12.7.3 The Durbin–Watson Test 270 12.7.4 Testing in the Presence of a Lagged Dependent Variables 270 12.7.5 Summary of Testing Procedures 271 12.8 Efficient Estimation When Is Known 271 12.9 Estimation When Is Unknown 273 12.9.1 AR(1) Disturbances 273 12.9.2 AR(2) Disturbances 274 12.9.3 Application: Estimation of a Model with Autocorrelation 274 12.9.4 Estimation with a Lagged Dependent Variable 277 12.10 Common Factors 278 12.11 Forecasting in the Presence of Autocorrelation 279 12.12 Summary and Conclusions 280 CHAPTER 13 Models for Panel Data 283 13.1 Introduction 283 13.2 Panel Data Models 283 13.3 Fixed Effects 287 13.3.1 Testing the Significance of the Group Effects 289 13.3.2 The Within- and Between-Groups Estimators 289 13.3.3 Fixed Time and Group Effects 291 13.3.4 Unbalanced Panels and Fixed Effects 293 13.4 Random Effects 293 13.4.1 Generalized Least Squares 295 13.4.2 Feasible Generalized Least Squares When Is Unknown 296 13.4.3 Testing for Random Effects 298 13.4.4 Hausman’s Specification Test for the Random Effects Model 301 13.5 Instrumental Variables Estimation of the Random Effects Model 303 13.6 GMM Estimation of Dynamic Panel Data Models 307 13.7 Nonspherical Disturbances and Robust Covariance Estimation 314 13.7.1 Robust Estimation of the Fixed Effects Model 314
  • 11. Greene-50240 gree50240˙FM July 10, 2002 12:51 Contents xv 13.7.2 Heteroscedasticity in the Random Effects Model 316 13.7.3 Autocorrelation in Panel Data Models 317 13.8 Random Coefficients Models 318 13.9 Covariance Structures for Pooled Time-Series Cross-Sectional Data 320 13.9.1 Generalized Least Squares Estimation 321 13.9.2 Feasible GLS Estimation 322 13.9.3 Heteroscedasticity and the Classical Model 323 13.9.4 Specification Tests 323 13.9.5 Autocorrelation 324 13.9.6 Maximum Likelihood Estimation 326 13.9.7 Application to Grunfeld’s Investment Data 329 13.9.8 Summary 333 13.10 Summary and Conclusions 334 CHAPTER 14 Systems of Regression Equations 339 14.1 Introduction 339 14.2 The Seemingly Unrelated Regressions Model 340 14.2.1 Generalized Least Squares 341 14.2.2 Seemingly Unrelated Regressions with Identical Regressors 343 14.2.3 Feasible Generalized Least Squares 344 14.2.4 Maximum Likelihood Estimation 347 14.2.5 An Application from Financial Econometrics: The Capital Asset Pricing Model 351 14.2.6 Maximum Likelihood Estimation of the Seemingly Unrelated Regressions Model with a Block of Zeros in the Coefficient Matrix 357 14.2.7 Autocorrelation and Heteroscedasticity 360 14.3 Systems of Demand Equations: Singular Systems 362 14.3.1 Cobb–Douglas Cost Function 363 14.3.2 Flexible Functional Forms: The Translog Cost Function 366 14.4 Nonlinear Systems and GMM Estimation 369 14.4.1 GLS Estimation 370 14.4.2 Maximum Likelihood Estimation 371 14.4.3 GMM Estimation 372 14.5 Summary and Conclusions 374 CHAPTER 15 Simultaneous-Equations Models 378 15.1 Introduction 378 15.2 Fundamental Issues in Simultaneous-Equations Models 378 15.2.1 Illustrative Systems of Equations 378 15.2.2 Endogeneity and Causality 381 15.2.3 A General Notation for Linear Simultaneous Equations Models 382 15.3 The Problem of Identification 385
  • 12. Greene-50240 gree50240˙FM July 10, 2002 12:51 xvi Contents 15.3.1 The Rank and Order Conditions for Identification 389 15.3.2 Identification Through Other Nonsample Information 394 15.3.3 Identification Through Covariance Restrictions—The Fully Recursive Model 394 15.4 Methods of Estimation 396 15.5 Single Equation: Limited Information Estimation Methods 396 15.5.1 Ordinary Least Squares 396 15.5.2 Estimation by Instrumental Variables 397 15.5.3 Two-Stage Least Squares 398 15.5.4 GMM Estimation 400 15.5.5 Limited Information Maximum Likelihood and the k Class of Estimators 401 15.5.6 Two-Stage Least Squares in Models That Are Nonlinear in Variables 403 15.6 System Methods of Estimation 404 15.6.1 Three-Stage Least Squares 405 15.6.2 Full-Information Maximum Likelihood 407 15.6.3 GMM Estimation 409 15.6.4 Recursive Systems and Exactly Identified Equations 411 15.7 Comparison of Methods—Klein’s Model I 411 15.8 Specification Tests 413 15.9 Properties of Dynamic Models 415 15.9.1 Dynamic Models and Their Multipliers 415 15.9.2 Stability 417 15.9.3 Adjustment to Equilibrium 418 15.10 Summary and Conclusions 421 CHAPTER 16 Estimation Frameworks in Econometrics 425 16.1 Introduction 425 16.2 Parametric Estimation and Inference 427 16.2.1 Classical Likelihood Based Estimation 428 16.2.2 Bayesian Estimation 429 16.2.2.a Bayesian Analysis of the Classical Regression Model 430 16.2.2.b Point Estimation 434 16.2.2.c Interval Estimation 435 16.2.2.d Estimation with an Informative Prior Density 435 16.2.2.e Hypothesis Testing 437 16.2.3 Using Bayes Theorem in a Classical Estimation Problem: The Latent Class Model 439 16.2.4 Hierarchical Bayes Estimation of a Random Parameters Model by Markov Chain Monte Carlo Simulation 444 16.3 Semiparametric Estimation 447 16.3.1 GMM Estimation in Econometrics 447 16.3.2 Least Absolute Deviations Estimation 448
  • 13. Greene-50240 gree50240˙FM July 10, 2002 12:51 Contents xvii 16.3.3 Partially Linear Regression 450 16.3.4 Kernel Density Methods 452 16.4 Nonparametric Estimation 453 16.4.1 Kernel Density Estimation 453 16.4.2 Nonparametric Regression 457 16.5 Properties of Estimators 460 16.5.1 Statistical Properties of Estimators 460 16.5.2 Extremum Estimators 461 16.5.3 Assumptions for Asymptotic Properties of Extremum Estimators 461 16.5.4 Asymptotic Properties of Estimators 464 16.5.5 Testing Hypotheses 465 16.6 Summary and Conclusions 466 CHAPTER 17 Maximum Likelihood Estimation 468 17.1 Introduction 468 17.2 The Likelihood Function and Identification of the Parameters 468 17.3 Efficient Estimation: The Principle of Maximum Likelihood 470 17.4 Properties of Maximum Likelihood Estimators 472 17.4.1 Regularity Conditions 473 17.4.2 Properties of Regular Densities 474 17.4.3 The Likelihood Equation 476 17.4.4 The Information Matrix Equality 476 17.4.5 Asymptotic Properties of the Maximum Likelihood Estimator 476 17.4.5.a Consistency 477 17.4.5.b Asymptotic Normality 478 17.4.5.c Asymptotic Efficiency 479 17.4.5.d Invariance 480 17.4.5.e Conclusion 480 17.4.6 Estimating the Asymptotic Variance of the Maximum Likelihood Estimator 480 17.4.7 Conditional Likelihoods and Econometric Models 482 17.5 Three Asymptotically Equivalent Test Procedures 484 17.5.1 The Likelihood Ratio Test 484 17.5.2 The Wald Test 486 17.5.3 The Lagrange Multiplier Test 489 17.5.4 An Application of the Likelihood Based Test Procedures 490 17.6 Applications of Maximum Likelihood Estimation 492 17.6.1 The Normal Linear Regression Model 492 17.6.2 Maximum Likelihood Estimation of Nonlinear Regression Models 496 17.6.3 Nonnormal Disturbances—The Stochastic Frontier Model 501 17.6.4 Conditional Moment Tests of Specification 505
  • 14. Greene-50240 gree50240˙FM July 10, 2002 12:51 xviii Contents 17.7 Two-Step Maximum Likelihood Estimation 508 17.8 Maximum Simulated Likelihood Estimation 512 17.9 Pseudo-Maximum Likelihood Estimation and Robust Asymptotic Covariance Matrices 518 17.10 Summary and Conclusions 521 CHAPTER 18 The Generalized Method of Moments 525 18.1 Introduction 525 18.2 Consistent Estimation: The Method of Moments 526 18.2.1 Random Sampling and Estimating the Parameters of Distributions 527 18.2.2 Asymptotic Properties of the Method of Moments Estimator 531 18.2.3 Summary—The Method of Moments 533 18.3 The Generalized Method of Moments (GMM) Estimator 533 18.3.1 Estimation Based on Orthogonality Conditions 534 18.3.2 Generalizing the Method of Moments 536 18.3.3 Properties of the GMM Estimator 540 18.3.4 GMM Estimation of Some Specific Econometric Models 544 18.4 Testing Hypotheses in the GMM Framework 548 18.4.1 Testing the Validity of the Moment Restrictions 548 18.4.2 GMM Counterparts to the Wald, LM, and LR Tests 549 18.5 Application: GMM Estimation of a Dynamic Panel Data Model of Local Government Expenditures 551 18.6 Summary and Conclusions 555 CHAPTER 19 Models with Lagged Variables 558 19.1 Introduction 558 19.2 Dynamic Regression Models 559 19.2.1 Lagged Effects in a Dynamic Model 560 19.2.2 The Lag and Difference Operators 562 19.2.3 Specification Search for the Lag Length 564 19.3 Simple Distributed Lag Models 565 19.3.1 Finite Distributed Lag Models 565 19.3.2 An Infinite Lag Model: The Geometric Lag Model 566 19.4 Autoregressive Distributed Lag Models 571 19.4.1 Estimation of the ARDL Model 572 19.4.2 Computation of the Lag Weights in the ARDL Model 573 19.4.3 Stability of a Dynamic Equation 573 19.4.4 Forecasting 576 19.5 Methodological Issues in the Analysis of Dynamic Models 579 19.5.1 An Error Correction Model 579 19.5.2 Autocorrelation 581
  • 15. Greene-50240 gree50240˙FM July 10, 2002 12:51 Contents xix 19.5.3 Specification Analysis 582 19.5.4 Common Factor Restrictions 583 19.6 Vector Autoregressions 586 19.6.1 Model Forms 587 19.6.2 Estimation 588 19.6.3 Testing Procedures 589 19.6.4 Exogeneity 590 19.6.5 Testing for Granger Causality 592 19.6.6 Impulse Response Functions 593 19.6.7 Structural VARs 595 19.6.8 Application: Policy Analysis with a VAR 596 19.6.9 VARs in Microeconomics 602 19.7 Summary and Conclusions 605 CHAPTER 20 Time-Series Models 608 20.1 Introduction 608 20.2 Stationary Stochastic Processes 609 20.2.1 Autoregressive Moving-Average Processes 609 20.2.2 Stationarity and Invertibility 611 20.2.3 Autocorrelations of a Stationary Stochastic Process 614 20.2.4 Partial Autocorrelations of a Stationary Stochastic Process 617 20.2.5 Modeling Univariate Time Series 619 20.2.6 Estimation of the Parameters of a Univariate Time Series 621 20.2.7 The Frequency Domain 624 20.2.7.a Theoretical Results 625 20.2.7.b Empirical Counterparts 627 20.3 Nonstationary Processes and Unit Roots 631 20.3.1 Integrated Processes and Differencing 631 20.3.2 Random Walks, Trends, and Spurious Regressions 632 20.3.3 Tests for Unit Roots in Economic Data 636 20.3.4 The Dickey–Fuller Tests 637 20.3.5 Long Memory Models 647 20.4 Cointegration 649 20.4.1 Common Trends 653 20.4.2 Error Correction and VAR Representations 654 20.4.3 Testing for Cointegration 655 20.4.4 Estimating Cointegration Relationships 657 20.4.5 Application: German Money Demand 657 20.4.5.a Cointegration Analysis and a Long Run Theoretical Model 659 20.4.5.b Testing for Model Instability 659 20.5 Summary and Conclusions 660
  • 16. Greene-50240 gree50240˙FM July 10, 2002 12:51 xx Contents CHAPTER 21 Models for Discrete Choice 663 21.1 Introduction 663 21.2 Discrete Choice Models 663 21.3 Models for Binary Choice 665 21.3.1 The Regression Approach 665 21.3.2 Latent Regression—Index Function Models 668 21.3.3 Random Utility Models 670 21.4 Estimation and Inference in Binary Choice Models 670 21.4.1 Robust Covariance Matrix Estimation 673 21.4.2 Marginal Effects 674 21.4.3 Hypothesis Tests 676 21.4.4 Specification Tests for Binary Choice Models 679 21.4.4.a Omitted Variables 680 21.4.4.b Heteroscedasticity 680 21.4.4.c A Specification Test for Nonnested Models—Testing for the Distribution 682 21.4.5 Measuring Goodness of Fit 683 21.4.6 Analysis of Proportions Data 686 21.5 Extensions of the Binary Choice Model 689 21.5.1 Random and Fixed Effects Models for Panel Data 689 21.5.1.a Random Effects Models 690 21.5.1.b Fixed Effects Models 695 21.5.2 Semiparametric Analysis 700 21.5.3 The Maximum Score Estimator (MSCORE) 702 21.5.4 Semiparametric Estimation 704 21.5.5 A Kernel Estimator for a Nonparametric Regression Function 706 21.5.6 Dynamic Binary Choice Models 708 21.6 Bivariate and Multivariate Probit Models 710 21.6.1 Maximum Likelihood Estimation 710 21.6.2 Testing for Zero Correlation 712 21.6.3 Marginal Effects 712 21.6.4 Sample Selection 713 21.6.5 A Multivariate Probit Model 714 21.6.6 Application: Gender Economics Courses in Liberal Arts Colleges 715 21.7 Logit Models for Multiple Choices 719 21.7.1 The Multinomial Logit Model 720 21.7.2 The Conditional Logit Model 723 21.7.3 The Independence from Irrelevant Alternatives 724 21.7.4 Nested Logit Models 725 21.7.5 A Heteroscedastic Logit Model 727 21.7.6 Multinomial Models Based on the Normal Distribution 727 21.7.7 A Random Parameters Model 728
  • 17. Greene-50240 gree50240˙FM July 10, 2002 12:51 Contents xxi 21.7.8 Application: Conditional Logit Model for Travel Mode Choice 729 21.8 Ordered Data 736 21.9 Models for Count Data 740 21.9.1 Measuring Goodness of Fit 741 21.9.2 Testing for Overdispersion 743 21.9.3 Heterogeneity and the Negative Binomial Regression Model 744 21.9.4 Application: The Poisson Regression Model 745 21.9.5 Poisson Models for Panel Data 747 21.9.6 Hurdle and Zero-Altered Poisson Models 749 21.10 Summary and Conclusions 752 CHAPTER 22 Limited Dependent Variable and Duration Models 756 22.1 Introduction 756 22.2 Truncation 756 22.2.1 Truncated Distributions 757 22.2.2 Moments of Truncated Distributions 758 22.2.3 The Truncated Regression Model 760 22.3 Censored Data 761 22.3.1 The Censored Normal Distribution 762 22.3.2 The Censored Regression (Tobit) Model 764 22.3.3 Estimation 766 22.3.4 Some Issues in Specification 768 22.3.4.a Heteroscedasticity 768 22.3.4.b Misspecification of Prob[y* < 0] 770 22.3.4.c Nonnormality 771 22.3.4.d Conditional Moment Tests 772 22.3.5 Censoring and Truncation in Models for Counts 773 22.3.6 Application: Censoring in the Tobit and Poisson Regression Models 774 22.4 The Sample Selection Model 780 22.4.1 Incidental Truncation in a Bivariate Distribution 781 22.4.2 Regression in a Model of Selection 782 22.4.3 Estimation 784 22.4.4 Treatment Effects 787 22.4.5 The Normality Assumption 789 22.4.6 Selection in Qualitative Response Models 790 22.5 Models for Duration Data 790 22.5.1 Duration Data 791 22.5.2 A Regression-Like Approach: Parametric Models of Duration 792 22.5.2.a Theoretical Background 792 22.5.2.b Models of the Hazard Function 793 22.5.2.c Maximum Likelihood Estimation 794
  • 18. Greene-50240 gree50240˙FM July 10, 2002 12:51 xxii Contents 22.5.2.d Exogenous Variables 796 22.5.2.e Heterogeneity 797 22.5.3 Other Approaches 798 22.6 Summary and Conclusions 801 APPENDIX A Matrix Algebra 803 A.1 Terminology 803 A.2 Algebraic Manipulation of Matrices 803 A.2.1 Equality of Matrices 803 A.2.2 Transposition 804 A.2.3 Matrix Addition 804 A.2.4 Vector Multiplication 805 A.2.5 A Notation for Rows and Columns of a Matrix 805 A.2.6 Matrix Multiplication and Scalar Multiplication 805 A.2.7 Sums of Values 807 A.2.8 A Useful Idempotent Matrix 808 A.3 Geometry of Matrices 809 A.3.1 Vector Spaces 809 A.3.2 Linear Combinations of Vectors and Basis Vectors 811 A.3.3 Linear Dependence 811 A.3.4 Subspaces 813 A.3.5 Rank of a Matrix 814 A.3.6 Determinant of a Matrix 816 A.3.7 A Least Squares Problem 817 A.4 Solution of a System of Linear Equations 819 A.4.1 Systems of Linear Equations 819 A.4.2 Inverse Matrices 820 A.4.3 Nonhomogeneous Systems of Equations 822 A.4.4 Solving the Least Squares Problem 822 A.5 Partitioned Matrices 822 A.5.1 Addition and Multiplication of Partitioned Matrices 823 A.5.2 Determinants of Partitioned Matrices 823 A.5.3 Inverses of Partitioned Matrices 823 A.5.4 Deviations from Means 824 A.5.5 Kronecker Products 824 A.6 Characteristic Roots and Vectors 825 A.6.1 The Characteristic Equation 825 A.6.2 Characteristic Vectors 826 A.6.3 General Results for Characteristic Roots and Vectors 826 A.6.4 Diagonalization and Spectral Decomposition of a Matrix 827 A.6.5 Rank of a Matrix 827 A.6.6 Condition Number of a Matrix 829 A.6.7 Trace of a Matrix 829 A.6.8 Determinant of a Matrix 830 A.6.9 Powers of a Matrix 830
  • 19. Greene-50240 gree50240˙FM July 10, 2002 12:51 Contents xxiii A.6.10 Idempotent Matrices 832 A.6.11 Factoring a Matrix 832 A.6.12 The Generalized Inverse of a Matrix 833 A.7 Quadratic Forms and Definite Matrices 834 A.7.1 Nonnegative Definite Matrices 835 A.7.2 Idempotent Quadratic Forms 836 A.7.3 Comparing Matrices 836 A.8 Calculus and Matrix Algebra 837 A.8.1 Differentiation and the Taylor Series 837 A.8.2 Optimization 840 A.8.3 Constrained Optimization 842 A.8.4 Transformations 844 APPENDIX B Probability and Distribution Theory 845 B.1 Introduction 845 B.2 Random Variables 845 B.2.1 Probability Distributions 845 B.2.2 Cumulative Distribution Function 846 B.3 Expectations of a Random Variable 847 B.4 Some Specific Probability Distributions 849 B.4.1 The Normal Distribution 849 B.4.2 The Chi-Squared, t, and F Distributions 851 B.4.3 Distributions With Large Degrees of Freedom 853 B.4.4 Size Distributions: The Lognormal Distribution 854 B.4.5 The Gamma and Exponential Distributions 855 B.4.6 The Beta Distribution 855 B.4.7 The Logistic Distribution 855 B.4.8 Discrete Random Variables 855 B.5 The Distribution of a Function of a Random Variable 856 B.6 Representations of a Probability Distribution 858 B.7 Joint Distributions 860 B.7.1 Marginal Distributions 860 B.7.2 Expectations in a Joint Distribution 861 B.7.3 Covariance and Correlation 861 B.7.4 Distribution of a Function of Bivariate Random Variables 862 B.8 Conditioning in a Bivariate Distribution 864 B.8.1 Regression: The Conditional Mean 864 B.8.2 Conditional Variance 865 B.8.3 Relationships Among Marginal and Conditional Moments 865 B.8.4 The Analysis of Variance 867 B.9 The Bivariate Normal Distribution 867 B.10 Multivariate Distributions 868 B.10.1 Moments 868
  • 20. Greene-50240 gree50240˙FM July 10, 2002 12:51 xxiv Contents B.10.2 Sets of Linear Functions 869 B.10.3 Nonlinear Functions 870 B.11 The Multivariate Normal Distribution 871 B.11.1 Marginal and Conditional Normal Distributions 871 B.11.2 The Classical Normal Linear Regression Model 872 B.11.3 Linear Functions of a Normal Vector 873 B.11.4 Quadratic Forms in a Standard Normal Vector 873 B.11.5 The F Distribution 875 B.11.6 A Full Rank Quadratic Form 875 B.11.7 Independence of a Linear and a Quadratic Form 876 APPENDIX C Estimation and Inference 877 C.1 Introduction 877 C.2 Samples and Random Sampling 878 C.3 Descriptive Statistics 878 C.4 Statistics as Estimators—Sampling Distributions 882 C.5 Point Estimation of Parameters 885 C.5.1 Estimation in a Finite Sample 885 C.5.2 Efficient Unbiased Estimation 888 C.6 Interval Estimation 890 C.7 Hypothesis Testing 892 C.7.1 Classical Testing Procedures 892 C.7.2 Tests Based on Confidence Intervals 895 C.7.3 Specification Tests 896 APPENDIX D Large Sample Distribution Theory 896 D.1 Introduction 896 D.2 Large-Sample Distribution Theory 897 D.2.1 Convergence in Probability 897 D.2.2 Other Forms of Convergence and Laws of Large Numbers 900 D.2.3 Convergence of Functions 903 D.2.4 Convergence to a Random Variable 904 D.2.5 Convergence in Distribution: Limiting Distributions 906 D.2.6 Central Limit Theorems 908 D.2.7 The Delta Method 913 D.3 Asymptotic Distributions 914 D.3.1 Asymptotic Distribution of a Nonlinear Function 916 D.3.2 Asymptotic Expectations 917 D.4 Sequences and the Order of a Sequence 918 APPENDIX E Computation and Optimization 919 E.1 Introduction 919 E.2 Data Input and Generation 920 E.2.1 Generating Pseudo-Random Numbers 920
  • 21. Greene-50240 gree50240˙FM July 10, 2002 12:51 Contents xxv E.2.2 Sampling from a Standard Uniform Population 921 E.2.3 Sampling from Continuous Distributions 921 E.2.4 Sampling from a Multivariate Normal Population 922 E.2.5 Sampling from a Discrete Population 922 E.2.6 The Gibbs Sampler 922 E.3 Monte Carlo Studies 923 E.4 Bootstrapping and the Jackknife 924 E.5 Computation in Econometrics 925 E.5.1 Computing Integrals 926 E.5.2 The Standard Normal Cumulative Distribution Function 926 E.5.3 The Gamma and Related Functions 927 E.5.4 Approximating Integrals by Quadrature 928 E.5.5 Monte Carlo Integration 929 E.5.6 Multivariate Normal Probabilities and Simulated Moments 931 E.5.7 Computing Derivatives 933 E.6 Optimization 933 E.6.1 Algorithms 935 E.6.2 Gradient Methods 935 E.6.3 Aspects of Maximum Likelihood Estimation 939 E.6.4 Optimization with Constraints 941 E.6.5 Some Practical Considerations 942 E.6.6 Examples 943 APPENDIX F Data Sets Used in Applications 946 APPENDIX G Statistical Tables 953 References 959 Author Index 000 Subject Index 000
  • 22. Greene-50240 gree50240˙FM July 10, 2002 12:51 P R E FA C E Q 1. THE FIFTH EDITION OF ECONOMETRIC ANALYSIS Econometric Analysis is intended for a one-year graduate course in econometrics for social scientists. The prerequisites for this course should include calculus, mathematical statistics, and an introduction to econometrics at the level of, say, Gujarati’s Basic Econo- metrics (McGraw-Hill, 1995) or Wooldridge’s Introductory Econometrics: A Modern Approach [South-Western (2000)]. Self-contained (for our purposes) summaries of the matrix algebra, mathematical statistics, and statistical theory used later in the book are given in Appendices A through D. Appendix E contains a description of numerical methods that will be useful to practicing econometricians. The formal presentation of econometrics begins with discussion of a fundamental pillar, the linear multiple regres- sion model, in Chapters 2 through 8. Chapters 9 through 15 present familiar extensions of the single linear equation model, including nonlinear regression, panel data models, the generalized regression model, and systems of equations. The linear model is usually not the sole technique used in most of the contemporary literature. In view of this, the (expanding) second half of this book is devoted to topics that will extend the linear regression model in many directions. Chapters 16 through 18 present the techniques and underlying theory of estimation in econometrics, including GMM and maximum likelihood estimation methods and simulation based techniques. We end in the last four chapters, 19 through 22, with discussions of current topics in applied econometrics, in- cluding time-series analysis and the analysis of discrete choice and limited dependent variable models. This book has two objectives. The first is to introduce students to applied econo- metrics, including basic techniques in regression analysis and some of the rich variety of models that are used when the linear model proves inadequate or inappropriate. The second is to present students with sufficient theoretical background that they will recognize new variants of the models learned about here as merely natural extensions that fit within a common body of principles. Thus, I have spent what might seem to be a large amount of effort explaining the mechanics of GMM estimation, nonlinear least squares, and maximum likelihood estimation and GARCH models. To meet the second objective, this book also contains a fair amount of theoretical material, such as that on maximum likelihood estimation and on asymptotic results for regression models. Mod- ern software has made complicated modeling very easy to do, and an understanding of the underlying theory is important. I had several purposes in undertaking this revision. As in the past, readers continue to send me interesting ideas for my “next edition.” It is impossible to use them all, of xxvii
  • 23. Greene-50240 gree50240˙FM July 10, 2002 12:51 xxviii Preface course. Because the five volumes of the Handbook of Econometrics and two of the Handbook of Applied Econometrics already run to over 4,000 pages, it is also unneces- sary. Nonetheless, this revision is appropriate for several reasons. First, there are new and interesting developments in the field, particularly in the areas of microeconometrics (panel data, models for discrete choice) and, of course, in time series, which continues its rapid development. Second, I have taken the opportunity to continue fine-tuning the text as the experience and shared wisdom of my readers accumulates in my files. For this revision, that adjustment has entailed a substantial rearrangement of the material—the main purpose of that was to allow me to add the new material in a more compact and orderly way than I could have with the table of contents in the 4th edition. The litera- ture in econometrics has continued to evolve, and my third objective is to grow with it. This purpose is inherently difficult to accomplish in a textbook. Most of the literature is written by professionals for other professionals, and this textbook is written for students who are in the early stages of their training. But I do hope to provide a bridge to that literature, both theoretical and applied. This book is a broad survey of the field of econometrics. This field grows con- tinually, and such an effort becomes increasingly difficult. (A partial list of journals devoted at least in part, if not completely, to econometrics now includes the Journal of Applied Econometrics, Journal of Econometrics, Econometric Theory, Econometric Reviews, Journal of Business and Economic Statistics, Empirical Economics, and Econo- metrica.) Still, my view has always been that the serious student of the field must start somewhere, and one can successfully seek that objective in a single textbook. This text attempts to survey, at an entry level, enough of the fields in econometrics that a student can comfortably move from here to practice or more advanced study in one or more specialized areas. At the same time, I have tried to present the material in sufficient generality that the reader is also able to appreciate the important common foundation of all these fields and to use the tools that they all employ. There are now quite a few recently published texts in econometrics. Several have gathered in compact, elegant treatises, the increasingly advanced and advancing theo- retical background of econometrics. Others, such as this book, focus more attention on applications of econometrics. One feature that distinguishes this work from its prede- cessors is its greater emphasis on nonlinear models. [Davidson and MacKinnon (1993) is a noteworthy, but more advanced, exception.] Computer software now in wide use has made estimation of nonlinear models as routine as estimation of linear ones, and the recent literature reflects that progression. My purpose is to provide a textbook treat- ment that is in line with current practice. The book concludes with four lengthy chapters on time-series analysis, discrete choice models and limited dependent variable models. These nonlinear models are now the staples of the applied econometrics literature. This book also contains a fair amount of material that will extend beyond many first courses in econometrics, including, perhaps, the aforementioned chapters on limited dependent variables, the section in Chapter 22 on duration models, and some of the discussions of time series and panel data models. Once again, I have included these in the hope of providing a bridge to the professional literature in these areas. I have had one overriding purpose that has motivated all five editions of this work. For the vast majority of readers of books such as this, whose ambition is to use, not develop econometrics, I believe that it is simply not sufficient to recite the theory of estimation, hypothesis testing and econometric analysis. Understanding the often subtle
  • 24. Greene-50240 gree50240˙FM July 10, 2002 12:51 Preface xxix background theory is extremely important. But, at the end of the day, my purpose in writing this work, and for my continuing efforts to update it in this now fifth edition, is to show readers how to do econometric analysis. I unabashedly accept the unflatter- ing assessment of a correspondent who once likened this book to a “user’s guide to econometrics.” 2. SOFTWARE AND DATA There are many computer programs that are widely used for the computations described in this book. All were written by econometricians or statisticians, and in general, all are regularly updated to incorporate new developments in applied econometrics. A sampling of the most widely used packages and Internet home pages where you can find information about them are: E-Views (QMS, Irvine, Calif.) Gauss (Aptech Systems, Kent, Wash.) LIMDEP (Econometric Software, Plainview, N.Y.) RATS (Estima, Evanston, Ill.) SAS (SAS, Cary, N.C.) Shazam (Ken White, UBC, Vancouver, B.C.) Stata (Stata, College Station, Tex.) TSP (TSP International, Stanford, Calif.) Programs vary in size, complexity, cost, the amount of programming required of the user, and so on. Journals such as The American Statistician, The Journal of Applied Econo- metrics, and The Journal of Economic Surveys regularly publish reviews of individual packages and comparative surveys of packages, usually with reference to particular functionality such as panel data analysis or forecasting. With only a few exceptions, the computations described in this book can be carried out with any of these packages. We hesitate to link this text to any of them in partic- ular. We have placed for general access a customized version of LIMDEP, which was also written by the author, on the website for this text, ∼wgreene/Text/econometricanalysis.htm. LIMDEP programs used for many of the computations are posted on the sites as well. The data sets used in the examples are also on the website. Throughout the text, these data sets are referred to “TableFn.m,” for example Table F4.1. The F refers to Appendix F at the back of the text, which contains descriptions of the data sets. The actual data are posted on the website with the other supplementary materials for the text. (The data sets are also replicated in the system format of most of the commonly used econometrics computer programs, including in addition to LIMDEP, SAS, TSP, SPSS, E-Views, and Stata, so that you can easily import them into whatever program you might be using.) I should also note, there are now thousands of interesting websites containing soft- ware, data sets, papers, and commentary on econometrics. It would be hopeless to attempt any kind of a survey here. But, I do note one which is particularly agree- ably structured and well targeted for readers of this book, the data archive for the
  • 25. Greene-50240 gree50240˙FM July 10, 2002 12:51 xxx Preface Journal of Applied Econometrics. This journal publishes many papers that are precisely at the right level for readers of this text. They have archived all the nonconfidential data sets used in their publications since 1994. This useful archive can be found at 3. ACKNOWLEDGEMENTS It is a pleasure to express my appreciation to those who have influenced this work. I am grateful to Arthur Goldberger and Arnold Zellner for their encouragement, guidance, and always interesting correspondence. Dennis Aigner and Laurits Christensen were also influential in shaping my views on econometrics. Some collaborators to the earlier editions whose contributions remain in this one include Aline Quester, David Hensher, and Donald Waldman. The number of students and colleagues whose suggestions have helped to produce what you find here is far too large to allow me to thank them all individually. I would like to acknowledge the many reviewers of my work whose care- ful reading has vastly improved the book: Badi Baltagi, University of Houston: Neal Beck, University of California at San Diego; Diane Belleville, Columbia University; Anil Bera, University of Illinois; John Burkett, University of Rhode Island; Leonard Carlson, Emory University; Frank Chaloupka, City University of New York; Chris Cornwell, University of Georgia; Mitali Das, Columbia University; Craig Depken II, University of Texas at Arlington; Edward Dwyer, Clemson University; Michael Ellis, Wesleyan University; Martin Evans, New York University; Ed Greenberg, Washington University at St. Louis; Miguel Herce, University of North Carolina; K. Rao Kadiyala, Purdue University; Tong Li, Indiana University; Lubomir Litov, New York University; William Lott, University of Connecticut; Edward Mathis, Villanova University; Mary McGarvey, University of Nebraska-Lincoln; Ed Melnick, New York University; Thad Mirer, State University of New York at Albany; Paul Ruud, University of California at Berkeley; Sherrie Rhine, Chicago Federal Reserve Board; Terry G. Seaks, University of North Carolina at Greensboro; Donald Snyder, California State University at Los Angeles; Steven Stern, University of Virginia; Houston Stokes, University of Illinois at Chicago; Dimitrios Thomakos, Florida International University; Paul Wachtel, New York University; Mark Watson, Harvard University; and Kenneth West, University of Wisconsin. My numerous discussions with B. D. McCullough have improved Ap- pendix E and at the same time increased my appreciation for numerical analysis. I am especially grateful to Jan Kiviet of the University of Amsterdam, who subjected my third edition to a microscopic examination and provided literally scores of sugges- tions, virtually all of which appear herein. Chapters 19 and 20 have also benefited from previous reviews by Frank Diebold, B. D. McCullough, Mary McGarvey, and Nagesh Revankar. I would also like to thank Rod Banister, Gladys Soto, Cindy Regan, Mike Reynolds, Marie McHale, Lisa Amato, and Torie Anderson at Prentice Hall for their contributions to the completion of this book. As always, I owe the greatest debt to my wife, Lynne, and to my daughters, Lesley, Allison, Elizabeth, and Julianna. William H. Greene
  • 26. Greene-50240 book May 24, 2002 10:36 1 INTRODUCTION Q 1.1 ECONOMETRICS In the first issue of Econometrica, the Econometric Society stated that its main object shall be to promote studies that aim at a unification of the theoretical-quantitative and the empirical-quantitative approach to economic problems and that are penetrated by constructive and rigorous thinking similar to that which has come to dominate the natural sciences. But there are several aspects of the quantitative approach to economics, and no single one of these aspects taken by itself, should be confounded with econo- metrics. Thus, econometrics is by no means the same as economic statistics. Nor is it identical with what we call general economic theory, although a consider- able portion of this theory has a definitely quantitative character. Nor should econometrics be taken as synonomous [sic] with the application of mathematics to economics. Experience has shown that each of these three viewpoints, that of statistics, economic theory, and mathematics, is a necessary, but not by itself a sufficient, condition for a real understanding of the quantitative relations in modern economic life. It is the unification of all three that is powerful. And it is this unification that constitutes econometrics. Frisch (1933) and his society responded to an unprecedented accumulation of statisti- cal information. They saw a need to establish a body of principles that could organize what would otherwise become a bewildering mass of data. Neither the pillars nor the objectives of econometrics have changed in the years since this editorial appeared. Econometrics is the field of economics that concerns itself with the application of math- ematical statistics and the tools of statistical inference to the empirical measurement of relationships postulated by economic theory. 1.2 ECONOMETRIC MODELING Econometric analysis will usually begin with a statement of a theoretical proposition. Consider, for example, a canonical application: Example 1.1 Keynes’s Consumption Function From Keynes’s (1936) General Theory of Employment, Interest and Money: We shall therefore define what we shall call the propensity to consume as the func- tional relationship f between X , a given level of income and C, the expenditure on consumption out of the level of income, so that C = f ( X ) . The amount that the community spends on consumption depends (i) partly on the amount of its income, (ii) partly on other objective attendant circumstances, and 1
  • 27. Greene-50240 book May 24, 2002 10:36 2 CHAPTER 1 ✦ Introduction (iii) partly on the subjective needs and the psychological propensities and habits of the individuals composing it. The fundamental psychological law upon which we are entitled to depend with great confidence, both a priori from our knowledge of human nature and from the detailed facts of experience, is that men are disposed, as a rule and on the average, to increase their consumption as their income increases, but not by as much as the increase in their income.1 That is, . . . dC/dX is positive and less than unity. But, apart from short period changes in the level of income, it is also obvious that a higher absolute level of income will tend as a rule to widen the gap between income and consumption. . . . These reasons will lead, as a rule, to a greater proportion of income being saved as real income increases. The theory asserts a relationship between consumption and income, C = f ( X ) , and claims in the third paragraph that the marginal propensity to consume (MPC), dC/dX , is between 0 and 1. The final paragraph asserts that the average propensity to consume (APC), C/ X , falls as income rises, or d( C/ X ) /dX = ( MPC − APC) / X < 0. It follows that MPC < APC. The most common formulation of the consumption function is a linear relationship, C = α + β X , that satisfies Keynes’s “laws” if β lies between zero and one and if α is greater than zero. These theoretical propositions provide the basis for an econometric study. Given an ap- propriate data set, we could investigate whether the theory appears to be consistent with the observed “facts.” For example, we could see whether the linear specification appears to be a satisfactory description of the relationship between consumption and income, and, if so, whether α is positive and β is between zero and one. Some issues that might be stud- ied are (1) whether this relationship is stable through time or whether the parameters of the relationship change from one generation to the next (a change in the average propensity to save, 1—APC, might represent a fundamental change in the behavior of consumers in the economy); (2) whether there are systematic differences in the relationship across different countries, and, if so, what explains these differences; and (3) whether there are other factors that would improve the ability of the model to explain the relationship between consumption and income. For example, Figure 1.1 presents aggregate consumption and personal income in constant dollars for the U.S. for the 10 years of 1970–1979. (See Appendix Table F1.1.) Apparently, at least superficially, the data (the facts) are consistent with the theory. The rela- tionship appears to be linear, albeit only approximately, the intercept of a line that lies close to most of the points is positive and the slope is less than one, although not by much. Economic theories such as Keynes’s are typically crisp and unambiguous. Models of demand, production, and aggregate consumption all specify precise, deterministic relationships. Dependent and independent variables are identified, a functional form is specified, and in most cases, at least a qualitative statement is made about the directions of effects that occur when independent variables in the model change. Of course, the model is only a simplification of reality. It will include the salient features of the rela- tionship of interest, but will leave unaccounted for influences that might well be present but are regarded as unimportant. No model could hope to encompass the myriad essen- tially random aspects of economic life. It is thus also necessary to incorporate stochastic elements. As a consequence, observations on a dependent variable will display varia- tion attributable not only to differences in variables that are explicitly accounted for, but also to the randomness of human behavior and the interaction of countless minor influences that are not. It is understood that the introduction of a random “disturbance” into a deterministic model is not intended merely to paper over its inadequacies. It is 1 Modern economists are rarely this confident about their theories. More contemporary applications generally begin from first principles and behavioral axioms, rather than simple observation.
  • 28. Greene-50240 book May 24, 2002 10:36 CHAPTER 1 ✦ Introduction 3 950 900 850 C 800 750 700 650 700 750 800 850 900 950 1000 1050 X FIGURE 1.1 Consumption Data, 1970–1979. essential to examine the results of the study, in a sort of postmortem, to ensure that the allegedly random, unexplained factor is truly unexplainable. If it is not, the model is, in fact, inadequate. The stochastic element endows the model with its statistical proper- ties. Observations on the variable(s) under study are thus taken to be the outcomes of a random process. With a sufficiently detailed stochastic structure and adequate data, the analysis will become a matter of deducing the properties of a probability distri- bution. The tools and methods of mathematical statistics will provide the operating principles. A model (or theory) can never truly be confirmed unless it is made so broad as to include every possibility. But it may be subjected to ever more rigorous scrutiny and, in the face of contradictory evidence, refuted. A deterministic theory will be invali- dated by a single contradictory observation. The introduction of stochastic elements into the model changes it from an exact statement to a probabilistic description about expected outcomes and carries with it an important implication. Only a preponder- ance of contradictory evidence can convincingly invalidate the probabilistic model, and what constitutes a “preponderance of evidence” is a matter of interpretation. Thus, the probabilistic model is less precise but at the same time, more robust.2 The process of econometric analysis departs from the specification of a theoreti- cal relationship. We initially proceed on the optimistic assumption that we can obtain precise measurements on all the variables in a correctly specified model. If the ideal conditions are met at every step, the subsequent analysis will probably be routine. Unfortunately, they rarely are. Some of the difficulties one can expect to encounter are the following: 2 See Keuzenkamp and Magnus (1995) for a lengthy symposium on testing in econometrics.
  • 29. Greene-50240 book May 24, 2002 10:36 4 CHAPTER 1 ✦ Introduction • The data may be badly measured or may correspond only vaguely to the variables in the model. “The interest rate” is one example. • Some of the variables may be inherently unmeasurable. “Expectations” are a case in point. • The theory may make only a rough guess as to the correct functional form, if it makes any at all, and we may be forced to choose from an embarrassingly long menu of possibilities. • The assumed stochastic properties of the random terms in the model may be demonstrably violated, which may call into question the methods of estimation and inference procedures we have used. • Some relevant variables may be missing from the model. The ensuing steps of the analysis consist of coping with these problems and attempting to cull whatever information is likely to be present in such obviously imperfect data. The methodology is that of mathematical statistics and economic theory. The product is an econometric model. 1.3 DATA AND METHODOLOGY The connection between underlying behavioral models and the modern practice of econometrics is increasingly strong. Practitioners rely heavily on the theoretical tools of microeconomics including utility maximization, profit maximization, and market equilibrium. Macroeconomic model builders rely on the interactions between economic agents and policy makers. The analyses are directed at subtle, difficult questions that often require intricate, complicated formulations. A few applications: • What are the likely effects on labor supply behavior of proposed negative income taxes? [Ashenfelter and Heckman (1974).] • Does a monetary policy regime that is strongly oriented toward controlling inflation impose a real cost in terms of lost output on the U.S. economy? [Cecchetti and Rich (2001).] • Did 2001’s largest federal tax cut in U.S. history contribute to or dampen the concurrent recession? Or was it irrelevant? (Still to be analyzed.) • Does attending an elite college bring an expected payoff in lifetime expected income sufficient to justify the higher tuition? [Krueger and Dale (2001) and Krueger (2002).] • Does a voluntary training program produce tangible benefits? Can these benefits be accurately measured? [Angrist (2001).] Each of these analyses would depart from a formal model of the process underlying the observed data. The field of econometrics is large and rapidly growing. In one dimension, we can distinguish between theoretical and applied econometrics. Theorists develop new techniques and analyze the consequences of applying particular methods when the as- sumptions that justify them are not met. Applied econometricians are the users of these techniques and the analysts of data (real world and simulated). Of course, the distinction is far from clean; practitioners routinely develop new analytical tools for the purposes of
  • 30. Greene-50240 book May 24, 2002 10:36 CHAPTER 1 ✦ Introduction 5 the study that they are involved in. This book contains a heavy dose of theory, but it is di- rected toward applied econometrics. I have attempted to survey techniques, admittedly some quite elaborate and intricate, that have seen wide use “in the field.” Another loose distinction can be made between microeconometrics and macro- econometrics. The former is characterized largely by its analysis of cross section and panel data and by its focus on individual consumers, firms, and micro-level decision mak- ers. Macroeconometrics is generally involved in the analysis of time series data, usually of broad aggregates such as price levels, the money supply, exchange rates, output, and so on. Once again, the boundaries are not sharp. The very large field of financial econo- metrics is concerned with long-time series data and occasionally vast panel data sets, but with a very focused orientation toward models of individual behavior. The analysis of market returns and exchange rate behavior is neither macro- nor microeconometric in nature, or perhaps it is some of both. Another application that we will examine in this text concerns spending patterns of municipalities, which, again, rests somewhere between the two fields. Applied econometric methods will be used for estimation of important quantities, analysis of economic outcomes, markets or individual behavior, testing theories, and for forecasting. The last of these is an art and science in itself, and (fortunately) the subject of a vast library of sources. Though we will briefly discuss some aspects of forecasting, our interest in this text will be on estimation and analysis of models. The presentation, where there is a distinction to be made, will contain a blend of microeconometric and macroeconometric techniques and applications. The first 18 chapters of the book are largely devoted to results that form the platform of both areas. Chapters 19 and 20 focus on time series modeling while Chapters 21 and 22 are devoted to methods more suited to cross sections and panels, and used more frequently in microeconometrics. Save for some brief applications, we will not be spending much time on financial econometrics. For those with an interest in this field, I would recommend the celebrated work by Campbell, Lo, and Mackinlay (1997). It is also necessary to distinguish between time series analysis (which is not our focus) and methods that primarily use time series data. The former is, like forecasting, a growth industry served by its own literature in many fields. While we will employ some of the techniques of time series analysis, we will spend relatively little time developing first principles. The techniques used in econometrics have been employed in a widening variety of fields, including political methodology, sociology [see, e.g., Long (1997)], health eco- nomics, medical research (how do we handle attrition from medical treatment studies?) environmental economics, transportation engineering, and numerous others. Practi- tioners in these fields and many more are all heavy users of the techniques described in this text. 1.4 PLAN OF THE BOOK The remainder of this book is organized into five parts: 1. Chapters 2 through 9 present the classical linear and nonlinear regression models. We will discuss specification, estimation, and statistical inference. 2. Chapters 10 through 15 describe the generalized regression model, panel data
  • 31. Greene-50240 book May 24, 2002 10:36 6 CHAPTER 1 ✦ Introduction applications, and systems of equations. 3. Chapters 16 through 18 present general results on different methods of estimation including maximum likelihood, GMM, and simulation methods. Various estimation frameworks, including non- and semiparametric and Bayesian estimation are presented in Chapters 16 and 18. 4. Chapters 19 through 22 present topics in applied econometrics. Chapters 19 and 20 are devoted to topics in time series modeling while Chapters 21 and 22 are about microeconometrics, discrete choice modeling, and limited dependent variables. 5. Appendices A through D present background material on tools used in econometrics including matrix algebra, probability and distribution theory, estimation, and asymptotic distribution theory. Appendix E presents results on computation. Appendices A through D are chapter-length surveys of the tools used in econometrics. Since it is assumed that the reader has some previous training in each of these topics, these summaries are included primarily for those who desire a refresher or a convenient reference. We do not anticipate that these appendices can substitute for a course in any of these subjects. The intent of these chapters is to provide a reasonably concise summary of the results, nearly all of which are explicitly used elsewhere in the book. The data sets used in the numerical examples are described in Appendix F. The actual data sets and other supplementary materials can be downloaded from the website for the text,
  • 32. Greene-50240 book May 24, 2002 13:34 2 THE CLASSICAL MULTIPLE LINEAR REGRESSION MODEL Q 2.1 INTRODUCTION An econometric study begins with a set of propositions about some aspect of the economy. The theory specifies a set of precise, deterministic relationships among vari- ables. Familiar examples are demand equations, production functions, and macroeco- nomic models. The empirical investigation provides estimates of unknown parameters in the model, such as elasticities or the effects of monetary policy, and usually attempts to measure the validity of the theory against the behavior of observable data. Once suitably constructed, the model might then be used for prediction or analysis of behavior. This book will develop a large number of models and techniques used in this framework. The linear regression model is the single most useful tool in the econometrician’s kit. Though to an increasing degree in the contemporary literature, it is often only the departure point for the full analysis, it remains the device used to begin almost all empirical research. This chapter will develop the model. The next several chapters will discuss more elaborate specifications and complications that arise in the application of techniques that are based on the simple models presented here. 2.2 THE LINEAR REGRESSION MODEL The multiple linear regression model is used to study the relationship between a depen- dent variable and one or more independent variables. The generic form of the linear regression model is y = f (x1 , x2 , . . . , xK ) + ε (2-1) = x1 β1 + x2 β2 + · · · + xK β K + ε where y is the dependent or explained variable and x1 , . . . , xK are the independent or explanatory variables. One’s theory will specify f (x1 , x2 , . . . , xK ). This function is commonly called the population regression equation of y on x1 , . . . , xK . In this set- ting, y is the regressand and xk, k = 1, . . . , K, are the regressors or covariates. The underlying theory will specify the dependent and independent variables in the model. It is not always obvious which is appropriately defined as each of these—for exam- ple, a demand equation, quantity = β1 + price × β2 + income × β3 + ε, and an inverse demand equation, price = γ1 + quantity × γ2 + income × γ3 + u are equally valid rep- resentations of a market. For modeling purposes, it will often prove useful to think in terms of “autonomous variation.” One can conceive of movement of the independent 7
  • 33. Greene-50240 book May 24, 2002 13:34 8 CHAPTER 2 ✦ The Classical Multiple Linear Regression Model variables outside the relationships defined by the model while movement of the depen- dent variable is considered in response to some independent or exogenous stimulus.1 The term ε is a random disturbance, so named because it “disturbs” an otherwise stable relationship. The disturbance arises for several reasons, primarily because we cannot hope to capture every influence on an economic variable in a model, no matter how elaborate. The net effect, which can be positive or negative, of these omitted factors is captured in the disturbance. There are many other contributors to the disturbance in an empirical model. Probably the most significant is errors of measurement. It is easy to theorize about the relationships among precisely defined variables; it is quite another to obtain accurate measures of these variables. For example, the difficulty of obtaining reasonable measures of profits, interest rates, capital stocks, or, worse yet, flows of services from capital stocks is a recurrent theme in the empirical literature. At the extreme, there may be no observable counterpart to the theoretical variable. The literature on the permanent income model of consumption [e.g., Friedman (1957)] provides an interesting example. We assume that each observation in a sample (yi , xi1 , xi2 , . . . , xi K ), i = 1, . . . , n, is generated by an underlying process described by yi = xi1 β1 + xi2 β2 + · · · + xi K β K + εi . The observed value of yi is the sum of two parts, a deterministic part and the random part, εi . Our objective is to estimate the unknown parameters of the model, use the data to study the validity of the theoretical propositions, and perhaps use the model to predict the variable y. How we proceed from here depends crucially on what we assume about the stochastic process that has led to our observations of the data in hand. Example 2.1 Keynes’s Consumption Function Example 1.1 discussed a model of consumption proposed by Keynes and his General Theory (1936). The theory that consumption, C, and income, X , are related certainly seems consistent with the observed “facts” in Figures 1.1 and 2.1. (These data are in Data Table F2.1.) Of course, the linear function is only approximate. Even ignoring the anomalous wartime years, consumption and income cannot be connected by any simple deterministic relationship. The linear model, C = α + β X , is intended only to represent the salient features of this part of the economy. It is hopeless to attempt to capture every influence in the relationship. The next step is to incorporate the inherent randomness in its real world counterpart. Thus, we write C = f ( X, ε) , where ε is a stochastic element. It is important not to view ε as a catchall for the inadequacies of the model. The model including ε appears adequate for the data not including the war years, but for 1942–1945, something systematic clearly seems to be missing. Consumption in these years could not rise to rates historically consistent with these levels of income because of wartime rationing. A model meant to describe consumption in this period would have to accommodate this influence. It remains to establish how the stochastic element will be incorporated in the equation. The most frequent approach is to assume that it is additive. Thus, we recast the equation in stochastic terms: C = α + β X + ε. This equation is an empirical counterpart to Keynes’s theoretical model. But, what of those anomalous years of rationing? If we were to ignore our intuition and attempt to “fit” a line to all these data—the next chapter will discuss at length how we should do that—we might arrive at the dotted line in the figure as our best guess. This line, however, is obviously being distorted by the rationing. A more appropriate 1 By this definition, it would seem that in our demand relationship, only income would be an independent variable while both price and quantity would be dependent. That makes sense—in a market, price and quantity are determined at the same time, and do change only when something outside the market changes. We will return to this specific case in Chapter 15.
  • 34. Greene-50240 book May 24, 2002 13:34 CHAPTER 2 ✦ The Classical Multiple Linear Regression Model 9 350 1950 325 1949 1947 1948 1946 300 C 1945 275 1944 1941 1943 250 1940 1942 225 225 250 275 300 325 350 375 X FIGURE 2.1 Consumption Data, 1940–1950. specification for these data that accommodates both the stochastic nature of the data and the special circumstances of the years 1942–1945 might be one that shifts straight down in the war years, C = α + β X + dwaryears δw + ε, where the new variable, dwaryears equals one in 1942–1945 and zero in other years and w < ∅. One of the most useful aspects of the multiple regression model is its ability to identify the independent effects of a set of variables on a dependent variable. Example 2.2 describes a common application. Example 2.2 Earnings and Education A number of recent studies have analyzed the relationship between earnings and educa- tion. We would expect, on average, higher levels of education to be associated with higher incomes. The simple regression model earnings = β1 + β2 education + ε, however, neglects the fact that most people have higher incomes when they are older than when they are young, regardless of their education. Thus, β2 will overstate the marginal impact of education. If age and education are positively correlated, then the regression model will associate all the observed increases in income with increases in education. A better specification would account for the effect of age, as in earnings = β1 + β2 education + β3 age + ε. It is often observed that income tends to rise less rapidly in the later earning years than in the early ones. To accommodate this possibility, we might extend the model to earnings = β1 + β2 education + β3 age + β4 age2 + ε. We would expect β3 to be positive and β4 to be negative. The crucial feature of this model is that it allows us to carry out a conceptual experiment that might not be observed in the actual data. In the example, we might like to (and could) compare the earnings of two individuals of the same age with different amounts of “education” even if the data set does not actually contain two such individuals. How education should be
  • 35. Greene-50240 book May 24, 2002 13:34 10 CHAPTER 2 ✦ The Classical Multiple Linear Regression Model measured in this setting is a difficult problem. The study of the earnings of twins by Ashenfelter and Krueger (1994), which uses precisely this specification of the earnings equation, presents an interesting approach. We will examine this study in some detail in Section 5.6.4. A large literature has been devoted to an intriguing question on this subject. Education is not truly “independent” in this setting. Highly motivated individuals will choose to pursue more education (for example, by going to college or graduate school) than others. By the same token, highly motivated individuals may do things that, on average, lead them to have higher incomes. If so, does a positive β2 that suggests an association between income and education really measure the effect of education on income, or does it reflect the effect of some underlying effect on both variables that we have not included in our regression model? We will revisit the issue in Section 22.4. 2.3 ASSUMPTIONS OF THE CLASSICAL LINEAR REGRESSION MODEL The classical linear regression model consists of a set of assumptions about how a data set will be produced by an underlying “data-generating process.” The theory will specify a deterministic relationship between the dependent variable and the independent vari- ables. The assumptions that describe the form of the model and relationships among its parts and imply appropriate estimation and inference procedures are listed in Table 2.1. 2.3.1 LINEARITY OF THE REGRESSION MODEL Let the column vector xk be the n observations on variable xk, k = 1, . . . , K, and as- semble these data in an n × K data matrix X. In most contexts, the first column of X is assumed to be a column of 1s so that β1 is the constant term in the model. Let y be the n observations, y1 , . . . , yn , and let ε be the column vector containing the n disturbances. TABLE 2.1 Assumptions of the Classical Linear Regression Model A1. Linearity: yi = xi1 β1 + xi2 β2 + · · · + xi K β K + εi . The model specifies a linear relationship between y and x1 , . . . , xK . A2. Full rank: There is no exact linear relationship among any of the independent variables in the model. This assumption will be necessary for estimation of the parameters of the model. A3. Exogeneity of the independent variables: E [εi | x j1 , x j2 , . . . , x j K ] = 0. This states that the expected value of the disturbance at observation i in the sample is not a function of the independent variables observed at any observation, including this one. This means that the independent variables will not carry useful information for prediction of εi . A4. Homoscedasticity and nonautocorrelation: Each disturbance, εi has the same finite variance, σ 2 and is uncorrelated with every other disturbance, ε j . This assumption limits the generality of the model, and we will want to examine how to relax it in the chapters to follow. A5. Exogenously generated data: The data in (x j1 , x j2 , . . . , x j K ) may be any mixture of constants and random variables. The process generating the data operates outside the assumptions of the model—that is, independently of the process that generates εi . Note that this extends A3. Analysis is done conditionally on the observed X. A6. Normal distribution: The disturbances are normally distributed. Once again, this is a convenience that we will dispense with after some analysis of its implications.
  • 36. Greene-50240 book May 24, 2002 13:34 CHAPTER 2 ✦ The Classical Multiple Linear Regression Model 11 The model in (2-1) as it applies to all n observations can now be written y = x1 β1 + · · · + x K β K + ε, (2-2) or in the form of Assumption 1, ASSUMPTION: y = Xβ + ε. (2-3) A NOTATIONAL CONVENTION. Henceforth, to avoid a possibly confusing and cumbersome notation, we will use a boldface x to denote a column or a row of X. Which applies will be clear from the context. In (2-2), xk is the kth column of X. Subscripts j and k will be used to denote columns (variables). It will often be convenient to refer to a single observation in (2-3), which we would write yi = xi β + εi . (2-4) Subscripts i and t will generally be used to denote rows (observations) of X. In (2-4), xi is a column vector that is the transpose of the ith 1 × K row of X. Our primary interest is in estimation and inference about the parameter vector β. Note that the simple regression model in Example 2.1 is a special case in which X has only two columns, the first of which is a column of 1s. The assumption of linearity of the regression model includes the additive disturbance. For the regression to be linear in the sense described here, it must be of the form in (2-1) either in the original variables or after some suitable transformation. For example, the model y = Ax β eε is linear (after taking logs on both sides of the equation), whereas y = Ax β + ε is not. The observed dependent variable is thus the sum of two components, a deter- ministic element α + βx and a random variable ε. It is worth emphasizing that neither of the two parts is directly observed because α and β are unknown. The linearity assumption is not so narrow as it might first appear. In the regression context, linearity refers to the manner in which the parameters and the disturbance enter the equation, not necessarily to the relationship among the variables. For example, the equations y = α + βx + ε, y = α + β cos(x) + ε, y = α + β/x + ε, and y = α + β ln x + ε are all linear in some function of x by the definition we have used here. In the examples, only x has been transformed, but y could have been as well, as in y = Ax β eε , which is a linear relationship in the logs of x and y; ln y = α + β ln x + ε. The variety of functions is unlimited. This aspect of the model is used in a number of commonly used functional forms. For example, the loglinear model is ln y = β1 + β2 ln X2 + β3 ln X3 + · · · + β K ln XK + ε. This equation is also known as the constant elasticity form as in this equation, the elasticity of y with respect to changes in x is ∂ ln y/∂ ln xk = βk, which does not vary
  • 37. Greene-50240 book May 24, 2002 13:34 12 CHAPTER 2 ✦ The Classical Multiple Linear Regression Model with xk. The log linear form is often used in models of demand and production. Different values of β produce widely varying functions. Example 2.3 The U.S. Gasoline Market Data on the U.S. gasoline market for the years 1960—1995 are given in Table F2.2 in Appendix F. We will use these data to obtain, among other things, estimates of the income, own price, and cross-price elasticities of demand in this market. These data also present an interesting question on the issue of holding “all other things constant,” that was suggested in Example 2.2. In particular, consider a somewhat abbreviated model of per capita gasoline consumption: ln( G/pop) = β1 + β2 ln income + β3 ln priceG + β4 ln Pnewcars + β5 ln Pusedcars + ε. This model will provide estimates of the income and price elasticities of demand for gasoline and an estimate of the elasticity of demand with respect to the prices of new and used cars. What should we expect for the sign of β4 ? Cars and gasoline are complementary goods, so if the prices of new cars rise, ceteris paribus, gasoline consumption should fall. Or should it? If the prices of new cars rise, then consumers will buy fewer of them; they will keep their used cars longer and buy fewer new cars. If older cars use more gasoline than newer ones, then the rise in the prices of new cars would lead to higher gasoline consumption than otherwise, not lower. We can use the multiple regression model and the gasoline data to attempt to answer the question. A semilog model is often used to model growth rates: ln yt = xt β + δt + εt . In this model, the autonomous (at least not explained by the model itself) proportional, per period growth rate is d ln y/dt = δ. Other variations of the general form f (yt ) = g(xt β + εt ) will allow a tremendous variety of functional forms, all of which fit into our definition of a linear model. The linear regression model is sometimes interpreted as an approximation to some unknown, underlying function. (See Section A.8.1 for discussion.) By this interpretation, however, the linear model, even with quadratic terms, is fairly limited in that such an approximation is likely to be useful only over a small range of variation of the independent variables. The translog model discussed in Example 2.4, in contrast, has proved far more effective as an approximating function. Example 2.4 The Translog Model Modern studies of demand and production are usually done in the context of a flexible func- tional form. Flexible functional forms are used in econometrics because they allow analysts to model second-order effects such as elasticities of substitution, which are functions of the second derivatives of production, cost, or utility functions. The linear model restricts these to equal zero, whereas the log linear model (e.g., the Cobb–Douglas model) restricts the inter- esting elasticities to the uninteresting values of –1 or +1. The most popular flexible functional form is the translog model, which is often interpreted as a second-order approximation to an unknown functional form. [See Berndt and Christensen (1973).] One way to derive it is as follows. We first write y = g( x1 , . . . , x K ) . Then, ln y = ln g( . . .) = f ( . . .) . Since by a trivial transformation xk = exp( ln xk ) , we interpret the function as a function of the logarithms of the x’s. Thus, ln y = f ( ln x1 , . . . , ln x K ) .
  • 38. Greene-50240 book May 24, 2002 13:34 CHAPTER 2 ✦ The Classical Multiple Linear Regression Model 13 Now, expand this function in a second-order Taylor series around the point x = [1, 1, . . . , 1] so that at the expansion point, the log of each variable is a convenient zero. Then K ln y = f ( 0) + [∂ f ( ·) /∂ ln xk ]| ln x=0 ln xk k=1 K K 1 + [∂ 2 f ( ·) /∂ ln xk ∂ ln xl ]| ln x=0 ln xk ln xl + ε. 2 k=1 l =1 The disturbance in this model is assumed to embody the familiar factors and the error of approximation to the unknown function. Since the function and its derivatives evaluated at the fixed value 0 are constants, we interpret them as the coefficients and write K K K 1 ln y = β0 + βk ln xk + γkl ln xk ln xl + ε. 2 k=1 k=1 l =1 This model is linear by our definition but can, in fact, mimic an impressive amount of curvature when it is used to approximate another function. An interesting feature of this formulation is that the log linear model is a special case, γkl = 0. Also, there is an interesting test of the underlying theory possible because if the underlying function were assumed to be continuous and twice continuously differentiable, then by Young’s theorem it must be true that γkl = γl k . We will see in Chapter 14 how this feature is studied in practice. Despite its great flexibility, the linear model does not include all the situations we encounter in practice. For a simple example, there is no transformation that will reduce y = α + 1/(β1 + β2 x) + ε to linearity. The methods we consider in this chapter are not appropriate for estimating the parameters of such a model. Relatively straightforward techniques have been developed for nonlinear models such as this, however. We shall treat them in detail in Chapter 9. 2.3.2 FULL RANK Assumption 2 is that there are no exact linear relationships among the variables. ASSUMPTION: X is an n × K matrix with rank K. (2-5) Hence, X has full column rank; the columns of X are linearly independent and there are at least K observations. [See (A-42) and the surrounding text.] This assumption is known as an identification condition. To see the need for this assumption, consider an example. Example 2.5 Short Rank Suppose that a cross-section model specifies C = β1 + β2 nonlabor income + β3 salary + β4 total income + ε, where total income is exactly equal to salary plus nonlabor income. Clearly, there is an exact linear dependency in the model. Now let β2 = β2 + a, β3 = β3 + a, and β4 = β4 − a,
  • 39. Greene-50240 book May 24, 2002 13:34 14 CHAPTER 2 ✦ The Classical Multiple Linear Regression Model where a is any number. Then the exact same value appears on the right-hand side of C if we substitute β2 , β3 , and β4 for β2 , β3 , and β4 . Obviously, there is no way to estimate the parameters of this model. If there are fewer than K observations, then X cannot have full rank. Hence, we make the (redundant) assumption that n is at least as large as K. In a two-variable linear model with a constant term, the full rank assumption means that there must be variation in the regressor x. If there is no variation in x, then all our observations will lie on a vertical line. This situation does not invalidate the other assumptions of the model; presumably, it is a flaw in the data set. The possibility that this suggests is that we could have drawn a sample in which there was variation in x, but in this instance, we did not. Thus, the model still applies, but we cannot learn about it from the data set in hand. 2.3.3 REGRESSION The disturbance is assumed to have conditional expected value zero at every observa- tion, which we write as E [εi | X] = 0. (2-6) For the full set of observations, we write Assumption 3 as:   E [ε1 | X]  E [ε2 | X]   ASSUMPTION: E [ε | X] =  .  = 0. (2-7)  . .  E [εn | X] There is a subtle point in this discussion that the observant reader might have noted. In (2-7), the left-hand side states, in principle, that the mean of each εi conditioned on all observations xi is zero. This conditional mean assumption states, in words, that no observations on x convey information about the expected value of the disturbance. It is conceivable—for example, in a time-series setting—that although xi might pro- vide no information about E [εi |·], x j at some other observation, such as in the next time period, might. Our assumption at this point is that there is no information about E [εi | ·] contained in any observation x j . Later, when we extend the model, we will study the implications of dropping this assumption. [See Woolridge (1995).] We will also assume that the disturbances convey no information about each other. That is, E [εi | ε1 , . . . , εi–1 , εi+1 , . . . , εn ] = 0. In sum, at this point, we have assumed that the disturbances are purely random draws from some population. The zero conditional mean implies that the unconditional mean is also zero, since E [εi ] = Ex [E [εi | X]] = Ex [0] = 0. Since, for each εi , Cov[E [εi | X], X] = Cov[εi , X], Assumption 3 implies that Cov[εi , X]= 0 for all i. (Exercise: Is the converse true?) In most cases, the zero mean assumption is not restrictive. Consider a two-variable model and suppose that the mean of ε is µ = 0. Then α + βx + ε is the same as (α + µ) + βx + (ε – µ). Letting α = α + µ and ε = ε–µ produces the original model. For an application, see the discussion of frontier production functions in Section 17.6.3.
  • 40. Greene-50240 book May 24, 2002 13:34 CHAPTER 2 ✦ The Classical Multiple Linear Regression Model 15 But, if the original model does not contain a constant term, then assuming E [εi ] = 0 could be substantive. If E [εi ] can be expressed as a linear function of xi , then, as before, a transformation of the model will produce disturbances with zero means. But, if not, then the nonzero mean of the disturbances will be a substantive part of the model structure. This does suggest that there is a potential problem in models without constant terms. As a general rule, regression models should not be specified without constant terms unless this is specifically dictated by the underlying theory.2 Arguably, if we have reason to specify that the mean of the disturbance is something other than zero, we should build it into the systematic part of the regression, leaving in the disturbance only the unknown part of ε. Assumption 3 also implies that E [y | X] = Xβ. (2-8) Assumptions 1 and 3 comprise the linear regression model. The regression of y on X is the conditional mean, E [y | X], so that without Assumption 3, Xβ is not the conditional mean function. The remaining assumptions will more completely specify the characteristics of the disturbances in the model and state the conditions under which the sample observations on x are obtained. 2.3.4 SPHERICAL DISTURBANCES The fourth assumption concerns the variances and covariances of the disturbances: Var[εi | X] = σ 2 , for all i = 1, . . . , n, and Cov[εi , ε j | X] = 0, for all i = j. Constant variance is labeled homoscedasticity. Consider a model that describes the prof- its of firms in an industry as a function of, say, size. Even accounting for size, measured in dollar terms, the profits of large firms will exhibit greater variation than those of smaller firms. The homoscedasticity assumption would be inappropriate here. Also, survey data on household expenditure patterns often display marked heteroscedasticity, even after accounting for income and household size. Uncorrelatedness across observations is labeled generically nonautocorrelation. In Figure 2.1, there is some suggestion that the disturbances might not be truly independent across observations. Although the number of observations is limited, it does appear that, on average, each disturbance tends to be followed by one with the same sign. This “inertia” is precisely what is meant by autocorrelation, and it is assumed away at this point. Methods of handling autocorrelation in economic data occupy a large proportion of the literature and will be treated at length in Chapter 12. Note that nonautocorrelation does not imply that observations yi and y j are uncorrelated. The assumption is that deviations of observations from their expected values are uncorrelated. 2 Models that describe first differences of variables might well be specified without constants. Consider y – y . t t–1 If there is a constant term α on the right-hand side of the equation, then yt is a function of αt, which is an explosive regressor. Models with linear time trends merit special treatment in the time-series literature. We will return to this issue in Chapter 19.
  • 41. Greene-50240 book May 24, 2002 13:34 16 CHAPTER 2 ✦ The Classical Multiple Linear Regression Model The two assumptions imply that   E [ε1 ε1 | X] E [ε1 ε2 | X] · · · E [ε1 εn | X]  E [ε2 ε1 | X] E [ε2 ε2 | X] · · · E [ε2 εn | X]   E [εε | X] =  . . . .   . . . . . . . .  E [εn ε1 | X] E [εn ε2 | X] · · · E [εn εn | X]  2  σ 0 ··· 0  0 σ2 ··· 0    = . ,  . .  0 0 ··· σ 2 which we summarize in Assumption 4: ASSUMPTION: E [εε | X] = σ 2 I. (2-9) By using the variance decomposition formula in (B-70), we find Var[ε] = E [Var[ε | X]] + Var[E [ε | X]] = σ 2 I. Once again, we should emphasize that this assumption describes the information about the variances and covariances among the disturbances that is provided by the indepen- dent variables. For the present, we assume that there is none. We will also drop this assumption later when we enrich the regression model. We are also assuming that the disturbances themselves provide no information about the variances and covariances. Although a minor issue at this point, it will become crucial in our treatment of time- series applications. Models such as Var[εt | εt–1 ] = σ 2 + αεt−1 —a “GARCH” model (see 2 Section 11.8)—do not violate our conditional variance assumption, but do assume that Var[εt | εt–1 ] = Var[εt ]. Disturbances that meet the twin assumptions of homoscedasticity and nonautocor- relation are sometimes called spherical disturbances.3 2.3.5 DATA GENERATING PROCESS FOR THE REGRESSORS It is common to assume that xi is nonstochastic, as it would be in an experimental situation. Here the analyst chooses the values of the regressors and then observes yi . This process might apply, for example, in an agricultural experiment in which yi is yield and xi is fertilizer concentration and water applied. The assumption of nonstochastic regressors at this point would be a mathematical convenience. With it, we could use the results of elementary statistics to obtain our results by treating the vector xi simply as a known constant in the probability distribution of yi . With this simplification, Assump- tions A3 and A4 would be made unconditional and the counterparts would now simply state that the probability distribution of εi involves none of the constants in X. Social scientists are almost never able to analyze experimental data, and relatively few of their models are built around nonrandom regressors. Clearly, for example, in 3 The term will describe the multivariate normal distribution; see (B-95). If = σ 2 I in the multivariate normal density, then the equation f (x) = c is the formula for a “ball” centered at µ with radius σ in n-dimensional space. The name spherical is used whether or not the normal distribution is assumed; sometimes the “spherical normal” distribution is assumed explicitly.
  • 42. Greene-50240 book May 24, 2002 13:34 CHAPTER 2 ✦ The Classical Multiple Linear Regression Model 17 any model of the macroeconomy, it would be difficult to defend such an asymmetric treatment of aggregate data. Realistically, we have to allow the data on xi to be random the same as yi so an alternative formulation is to assume that xi is a random vector and our formal assumption concerns the nature of the random process that produces xi . If xi is taken to be a random vector, then Assumptions 1 through 4 become a statement about the joint distribution of yi and xi . The precise nature of the regressor and how we view the sampling process will be a major determinant of our derivation of the statistical properties of our estimators and test statistics. In the end, the crucial assumption is Assumption 3, the uncorrelatedness of X and ε. Now, we do note that this alternative is not completely satisfactory either, since X may well contain nonstochastic elements, including a constant, a time trend, and dummy variables that mark specific episodes in time. This makes for an ambiguous conclusion, but there is a straightforward and economically useful way out of it. We will assume that X can be a mixture of constants and random variables, but the important assumption is that the ultimate source of the data in X is unrelated (statistically and economically) to the source of ε. ASSUMPTION: X may be fixed or random, but it is generated by a (2-10) mechanism that is unrelated to ε. 2.3.6 NORMALITY It is convenient to assume that the disturbances are normally distributed, with zero mean and constant variance. That is, we add normality of the distribution to Assumptions 3 and 4. ASSUMPTION: ε | X ∼ N[0, σ 2 I]. (2-11) In view of our description of the source of ε, the conditions of the central limit the- orem will generally apply, at least approximately, and the normality assumption will be reasonable in most settings. A useful implication of Assumption 6 is that it implies that observations on εi are statistically independent as well as uncorrelated. [See the third point in Section B.8, (B-97) and (B-99).] Normality is often viewed as an unnecessary and possibly inappropriate addition to the regression model. Except in those cases in which some alternative distribution is explicitly assumed, as in the stochastic frontier model discussed in Section 17.6.3, the normality assumption is probably quite reasonable. Normality is not necessary to obtain many of the results we use in multiple regression analysis, although it will enable us to obtain several exact statistical results. It does prove useful in constructing test statistics, as shown in Section 4.7. Later, it will be possible to relax this assumption and retain most of the statistical results we obtain here. (See Sections 5.3, 5.4 and 6.4.) 2.4 SUMMARY AND CONCLUSIONS This chapter has framed the linear regression model, the basic platform for model build- ing in econometrics. The assumptions of the classical regression model are summarized in Figure 2.2, which shows the two-variable case.
  • 43. Greene-50240 book May 24, 2002 13:34 18 CHAPTER 2 ✦ The Classical Multiple Linear Regression Model E(y|x) ␣ ϩ ␤x E(y|x ϭ x2) N(␣ ϩ ␤x2, ␴ 2) E(y|x ϭ x1) E(y|x ϭ x0) x0 x1 x2 x FIGURE 2.2 The Classical Regression Model. Key Terms and Concepts • Autocorrelation • Heteroscedasticity • Normally distributed • Constant elasticity • Homoscedasticity • Population regression • Covariate • Identification condition equation • Dependent variable • Independent variable • Regressand • Deterministic relationship • Linear regression model • Regression • Disturbance • Loglinear model • Regressor • Exogeneity • Multiple linear regression • Second-order effects • Explained variable model • Semilog • Explanatory variable • Nonautocorrelation • Spherical disturbances • Flexible functional form • Nonstochastic regressors • Translog model • Full rank • Normality
  • 44. Greene-50240 book June 3, 2002 9:52 3 LEAST SQUARES Q 3.1 INTRODUCTION Chapter 2 defined the linear regression model as a set of characteristics of the population that underlies an observed sample of data. There are a number of different approaches to estimation of the parameters of the model. For a variety of practical and theoretical reasons that we will explore as we progress through the next several chapters, the method of least squares has long been the most popular. Moreover, in most cases in which some other estimation method is found to be preferable, least squares remains the benchmark approach, and often, the preferred method ultimately amounts to a modification of least squares. In this chapter, we begin the analysis of this important set of results by presenting a useful set of algebraic tools. 3.2 LEAST SQUARES REGRESSION The unknown parameters of the stochastic relation yi = xi β + εi are the objects of estimation. It is necessary to distinguish between population quantities, such as β and εi , and sample estimates of them, denoted b and ei . The population regression is E [yi | xi ] = xi β, whereas our estimate of E [yi | xi ] is denoted yi = xi b. ˆ The disturbance associated with the ith data point is εi = yi − xi β. For any value of b, we shall estimate εi with the residual ei = yi − xi b. From the definitions, yi = xi β + εi = xi b + ei . These equations are summarized for the two variable regression in Figure 3.1. The population quantity β is a vector of unknown parameters of the probability distribution of yi whose values we hope to estimate with our sample data, (yi , xi ), i = 1, . . . , n. This is a problem of statistical inference. It is instructive, however, to begin by considering the purely algebraic problem of choosing a vector b so that the fitted line xi b is close to the data points. The measure of closeness constitutes a fitting criterion. 19
  • 45. Greene-50240 book June 3, 2002 9:52 20 CHAPTER 3 ✦ Least Squares y ␣ ϩ ␤x ␧ e a ϩ bx E(y) ϭ ␣ ϩ ␤x y ϭ a ϩ bx ˆ x FIGURE 3.1 Population and Sample Regression. Although numerous candidates have been suggested, the one used most frequently is least squares.1 3.2.1 THE LEAST SQUARES COEFFICIENT VECTOR The least squares coefficient vector minimizes the sum of squared residuals: n n ei0 = 2 (yi − xi b0 )2 , (3-1) i=1 i=1 where b0 denotes the choice for the coefficient vector. In matrix terms, minimizing the sum of squares in (3-1) requires us to choose b0 to Minimizeb0 S(b0 ) = e0 e0 = (y − Xb0 ) (y − Xb0 ). (3-2) Expanding this gives e0 e0 = y y − b0 X y − y Xb0 + b0 X Xb0 (3-3) or S(b0 ) = y y − 2y Xb0 + b0 X Xb0 . The necessary condition for a minimum is ∂ S(b0 ) = −2X y + 2X Xb0 = 0. (3-4) ∂b0 1 We shall have to establish that the practical approach of fitting the line as closely as possible to the data by least squares leads to estimates with good statistical properties. This makes intuitive sense and is, indeed, the case. We shall return to the statistical issues in Chapters 4 and 5.
  • 46. Greene-50240 book June 3, 2002 9:52 CHAPTER 3 ✦ Least Squares 21 Let b be the solution. Then b satisfies the least squares normal equations, X Xb = X y. (3-5) If the inverse of X X exists, which follows from the full rank assumption (Assumption A2 in Section 2.3), then the solution is b = (X X)−1 X y. (3-6) For this solution to minimize the sum of squares, ∂ 2 S(b) = 2X X ∂b ∂b must be a positive definite matrix. Let q = c X Xc for some arbitrary nonzero vector c. Then n q=vv= vi2 , where v = Xc. i=1 Unless every element of v is zero, q is positive. But if v could be zero, then v would be a linear combination of the columns of X that equals 0, which contradicts the assumption that X has full rank. Since c is arbitrary, q is positive for every nonzero c, which estab- lishes that 2X X is positive definite. Therefore, if X has full rank, then the least squares solution b is unique and minimizes the sum of squared residuals. 3.2.2 APPLICATION: AN INVESTMENT EQUATION To illustrate the computations in a multiple regression, we consider an example based on the macroeconomic data in Data Table F3.1. To estimate an investment equation, we first convert the investment and GNP series in Table F3.1 to real terms by dividing them by the CPI, and then scale the two series so that they are measured in trillions of dollars. The other variables in the regression are a time trend (1, 2, . . .), an interest rate, and the rate of inflation computed as the percentage change in the CPI. These produce the data matrices listed in Table 3.1. Consider first a regression of real investment on a constant, the time trend, and real GNP, which correspond to x1 , x2 , and x3 . (For reasons to be discussed in Chapter 20, this is probably not a well specified equation for these macroeconomic variables. It will suffice for a simple numerical example, however.) Inserting the specific variables of the example, we have b1 n + b2 iTi + b3 i Gi = iY, i b1 iTi + b2 iTi 2 + b3 i T Gi i = iTY, i i b1 i Gi + b2 i T Gi i + b3 2 i Gi = i Gi Y . i A solution can be obtained by first dividing the first equation by n and rearranging it to obtain b1 = Y − b2 T − b3 G ¯ ¯ ¯ = 0.20333 − b2 × 8 − b3 × 1.2873. (3-7)
  • 47. Greene-50240 book June 3, 2002 9:52 22 CHAPTER 3 ✦ Least Squares TABLE 3.1 Data Matrices Real Real Interest Inflation Investment Constant Trend GNP Rate Rate (Y) (1) (T) (G) (R) (P) 0.161 1 1 1.058 5.16 4.40 0.172 1 2 1.088 5.87 5.15 0.158 1 3 1.086 5.95 5.37 0.173 1 4 1.122 4.88 4.99 0.195 1 5 1.186 4.50 4.16 0.217 1 6 1.254 6.44 5.75 0.199 1 7 1.246 7.83 8.82 y = 0.163 X=1 8 1.232 6.25 9.31 0.195 1 9 1.298 5.50 5.21 0.231 1 10 1.370 5.46 5.83 0.257 1 11 1.439 7.46 7.40 0.259 1 12 1.479 10.28 8.64 0.225 1 13 1.474 11.77 9.31 0.241 1 14 1.503 13.42 9.44 0.204 1 15 1.475 11.02 5.99 Note: Subsequent results are based on these values. Slightly different results are obtained if the raw data in Table F3.1 are input to the computer program and transformed internally. Insert this solution in the second and third equations, and rearrange terms again to yield a set of two equations: b2 i (Ti − T )2 ¯ + b3 i (Ti − T )(Gi − G ) = ¯ ¯ i (Ti − T )(Yi − Y ), ¯ ¯ (3-8) b2 i (Ti − T )(Gi − G ) + b3 ¯ ¯ i (Gi − G )2 ¯ = i (Gi − G )(Yi − Y ). ¯ ¯ This result shows the nature of the solution for the slopes, which can be computed from the sums of squares and cross products of the deviations of the variables. Letting lowercase letters indicate variables measured as deviations from the sample means, we find that the least squares solutions for b2 and b3 are i ti yi 2 i gi − i gi yi i ti gi 1.6040(0.359609) − 0.066196(9.82) b2 = = = −0.0171984, 2 i ti i gi2 − ( i gi ti )2 280(0.359609) − (9.82)2 i gi yi i ti − i ti yi i ti gi 0.066196(280) − 1.6040(9.82) 2 b3 = = = 0.653723. 2 i ti i gi − ( i gi ti ) 2 2 280(0.359609) − (9.82)2 With these solutions in hand, the intercept can now be computed using (3-7); b1 = − 0.500639. Suppose that we just regressed investment on the constant and GNP, omitting the time trend. At least some of the correlation we observe in the data will be explainable because both investment and real GNP have an obvious time trend. Consider how this shows up in the regression computation. Denoting by “byx ” the slope in the simple, bivariate regression of variable y on a constant and the variable x, we find that the slope in this reduced regression would be i gi yi byg = 2 = 0.184078. (3-9) i gi
  • 48. Greene-50240 book June 3, 2002 9:52 CHAPTER 3 ✦ Least Squares 23 Now divide both the numerator and denominator in the expression for b3 by i ti2 i gi2 . By manipulating it a bit and using the definition of the sample correlation between G and T, rgt = ( i gi ti )2 /( i gi2 i ti2 ), and defining byt and btg likewise, we obtain 2 byg byt btg byg·t = − = 0.653723. (3-10) 1 − r gt 2 1 − r gt 2 (The notation “byg·t ” used on the left-hand side is interpreted to mean the slope in the regression of y on g “in the presence of t.”) The slope in the multiple regression differs from that in the simple regression by including a correction that accounts for the influence of the additional variable t on both Y and G. For a striking example of this effect, in the simple regression of real investment on a time trend, byt = 1.604/280 = 0.0057286, a positive number that reflects the upward trend apparent in the data. But, in the multiple regression, after we account for the influence of GNP on real investment, the slope on the time trend is −0.0171984, indicating instead a downward trend. The general result for a three-variable regression in which x1 is a constant term is by2 − by3 b32 by2·3 = . (3-11) 1 − r23 2 It is clear from this expression that the magnitudes of by2·3 and by2 can be quite different. They need not even have the same sign. 2 As a final observation, note what becomes of byg·t in (3-10) if r gt equals zero. The first term becomes byg , whereas the second becomes zero. (If G and T are not correlated, then the slope in the regression of G on T, btg , is zero.) Therefore, we conclude the following. THEOREM 3.1 Orthogonal Regression If the variables in a multiple regression are not correlated (i.e., are orthogonal), then the multiple regression slopes are the same as the slopes in the individual simple regressions. In practice, you will never actually compute a multiple regression by hand or with a calculator. For a regression with more than three variables, the tools of matrix algebra are indispensable (as is a computer). Consider, for example, an enlarged model of investment that includes—in addition to the constant, time trend, and GNP—an interest rate and the rate of inflation. Least squares requires the simultaneous solution of five normal equations. Letting X and y denote the full data matrices shown previously, the normal equations in (3-5) are      15.000 120.00 19.310 111.79 99.770 b1 3.0500 120.000 1240.0 164.30 1035.9 875.60  b2  26.004        19.310 164.30 25.218 148.98 131.22  b3  =  3.9926 .      111.79 1035.9 148.98 953.86 799.02  b4  23.521  99.770 875.60 131.22 799.02 716.67 b5 20.732
  • 49. Greene-50240 book June 3, 2002 9:52 24 CHAPTER 3 ✦ Least Squares The solution is b = (X X)−1 X y = (−0.50907, −0.01658, 0.67038, −0.002326, −0.00009401) . 3.2.3 ALGEBRAIC ASPECTS OF THE LEAST SQUARES SOLUTION The normal equations are X Xb − X y = −X (y − Xb) = −X e = 0. (3-12) Hence, for every column xk of X, xke = 0. If the first column of X is a column of 1s, then there are three implications. 1. The least squares residuals sum to zero. This implication follows from x1 e = i e = i ei = 0. 2. The regression hyperplane passes through the point of means of the data. The first normal equation implies that y = x b. ¯ ¯ 3. The mean of the fitted values from the regression equals the mean of the actual values. This implication follows from point 1 because the fitted values are just y = Xb. ˆ It is important to note that none of these results need hold if the regression does not contain a constant term. 3.2.4 PROJECTION The vector of least squares residuals is e = y − Xb. (3-13) Inserting the result in (3-6) for b gives e = y − X(X X)−1 X y = (I − X(X X)−1 X )y = My. (3-14) The n × n matrix M defined in (3-14) is fundamental in regression analysis. You can easily show that M is both symmetric (M = M ) and idempotent (M = M2 ). In view of (3-13), we can interpret M as a matrix that produces the vector of least squares residuals in the regression of y on X when it premultiplies any vector y. (It will be convenient later on to refer to this matrix as a “residual maker.”) It follows that MX = 0. (3-15) One way to interpret this result is that if X is regressed on X, a perfect fit will result and the residuals will be zero. Finally, (3-13) implies that y = Xb + e, which is the sample analog to (2-3). (See Figure 3.1 as well.) The least squares results partition y into two parts, the fitted values y = Xb and the residuals e. [See Section A.3.7, especially (A-54).] Since MX = 0, these ˆ two parts are orthogonal. Now, given (3-13), y = y − e = (I − M)y = X(X X)−1 X y = Py. ˆ (3-16) The matrix P, which is also symmetric and idempotent, is a projection matrix. It is the matrix formed from X such that when a vector y is premultiplied by P, the result is the fitted values in the least squares regression of y on X. This is also the projection of
  • 50. Greene-50240 book June 3, 2002 9:52 CHAPTER 3 ✦ Least Squares 25 the vector y into the column space of X. (See Sections A3.5 and A3.7.) By multiplying it out, you will find that, like M, P is symmetric and idempotent. Given the earlier results, it also follows that M and P are orthogonal; PM = MP = 0. Finally, as might be expected from (3-15) PX = X. As a consequence of (3-15) and (3-16), we can see that least squares partitions the vector y into two orthogonal parts, y = Py + My = projection + residual. The result is illustrated in Figure 3.2 for the two variable case. The gray shaded plane is the column space of X. The projection and residual are the orthogonal dotted rays. We can also see the Pythagorean theorem at work in the sums of squares, y y = y P Py + y M My = yy+ee ˆ ˆ In manipulating equations involving least squares results, the following equivalent expressions for the sum of squared residuals are often useful: e e = y M My = y My = y e = e y, e e = y y − b X Xb = y y − b X y = y y − y Xb. FIGURE 3.2 Projection of y into the column space of X. y e x1 ˆ y x2
  • 51. Greene-50240 book June 3, 2002 9:52 26 CHAPTER 3 ✦ Least Squares 3.3 PARTITIONED REGRESSION AND PARTIAL REGRESSION It is common to specify a multiple regression model when, in fact, interest centers on only one or a subset of the full set of variables. Consider the earnings equation discussed in Example 2.2. Although we are primarily interested in the association of earnings and education, age is, of necessity, included in the model. The question we consider here is what computations are involved in obtaining, in isolation, the coefficients of a subset of the variables in a multiple regression (for example, the coefficient of education in the aforementioned regression). Suppose that the regression involves two sets of variables X1 and X2 . Thus, y = Xβ + ε = X1 β 1 + X2 β 2 + ε. What is the algebraic solution for b2 ? The normal equations are (1) X1 X1 X1 X2 b1 X y = 1 . (3-17) (2) X2 X1 X2 X2 b2 X2 y A solution can be obtained by using the partitioned inverse matrix of (A-74). Alterna- tively, (1) and (2) in (3-17) can be manipulated directly to solve for b2 . We first solve (1) for b1 : b1 = (X1 X1 )−1 X1 y − (X1 X1 )−1 X1 X2 b2 = (X1 X1 )−1 X1 (y − X2 b2 ). (3-18) This solution states that b1 is the set of coefficients in the regression of y on X1 , minus a correction vector. We digress briefly to examine an important result embedded in (3-18). Suppose that X1 X2 = 0. Then, b1 = (X1 X1 )−1 X1 y, which is simply the coefficient vector in the regression of y on X1 . The general result, which we have just proved is the following theorem. THEOREM 3.2 Orthogonal Partitioned Regression In the multiple linear least squares regression of y on two sets of variables X1 and X2 , if the two sets of variables are orthogonal, then the separate coefficient vectors can be obtained by separate regressions of y on X1 alone and y on X2 alone. Note that Theorem 3.2 encompasses Theorem 3.1. Now, inserting (3-18) in equation (2) of (3-17) produces X2 X1 (X1 X1 )−1 X1 y − X2 X1 (X1 X1 )−1 X1 X2 b2 + X2 X2 b2 = X2 y. After collecting terms, the solution is −1 b2 = X2 (I − X1 (X1 X1 )−1 X1 )X2 X2 (I − X1 (X1 X1 )−1 X1 )y = (X2 M1 X2 )−1 (X2 M1 y). (3-19) The matrix appearing in the parentheses inside each set of square brackets is the “resid- ual maker” defined in (3-14), in this case defined for a regression on the columns of X1 .
  • 52. Greene-50240 book June 3, 2002 9:52 CHAPTER 3 ✦ Least Squares 27 Thus, M1 X2 is a matrix of residuals; each column of M1 X2 is a vector of residuals in the regression of the corresponding column of X2 on the variables in X1 . By exploiting the fact that M1 , like M, is idempotent, we can rewrite (3-19) as b2 = (X∗ X∗ )−1 X∗ y∗ , 2 2 2 (3-20) where X∗ = M1 X2 2 and y∗ = M1 y. This result is fundamental in regression analysis. THEOREM 3.3 Frisch–Waugh Theorem In the linear least squares regression of vector y on two sets of variables, X1 and X2 , the subvector b2 is the set of coefficients obtained when the residuals from a regression of y on X1 alone are regressed on the set of residuals obtained when each column of X2 is regressed on X1 . This process is commonly called partialing out or netting out the effect of X1 . For this reason, the coefficients in a multiple regression are often called the partial regression coefficients. The application of this theorem to the computation of a single coefficient as suggested at the beginning of this section is detailed in the following: Consider the regression of y on a set of variables X and an additional variable z. Denote the coefficients b and c. COROLLARY 3.3.1 Individual Regression Coefficients The coefficient on z in a multiple regression of y on W = [X, z] is computed as c = (z Mz)−1 (z My) = (z∗ z∗ )−1 z∗ y∗ where z∗ and y∗ are the residual vectors from least squares regressions of z and y on X; z∗ = Mz and y∗ = My where M is defined in (3-14). In terms of Example 2.2, we could obtain the coefficient on education in the multiple regression by first regressing earnings and education on age (or age and age squared) and then using the residuals from these regressions in a simple regression. In a classic application of this latter observation, Frisch and Waugh (1933) (who are credited with the result) noted that in a time-series setting, the same results were obtained whether a regression was fitted with a time-trend variable or the data were first “detrended” by netting out the effect of time, as noted earlier, and using just the detrended data in a simple regression.2 2 Recall our earlier investment example.
  • 53. Greene-50240 book June 3, 2002 9:52 28 CHAPTER 3 ✦ Least Squares As an application of these results, consider the case in which X1 is i, a column of 1s in the first column of X. The solution for b2 in this case will then be the slopes in a regression with a constant term. The coefficient in a regression of any variable z on i is [i i]−1 i z = z, the fitted values are iz, and the residuals are zi − z. When we apply this to ¯ ¯ ¯ our previous results, we find the following. COROLLARY 3.3.2 Regression with a Constant Term The slopes in a multiple regression that contains a constant term are obtained by transforming the data to deviations from their means and then regressing the variable y in deviation form on the explanatory variables, also in deviation form. [We used this result in (3-8).] Having obtained the coefficients on X2 , how can we recover the coefficients on X1 (the constant term)? One way is to repeat the exercise while reversing the roles of X1 and X2 . But there is an easier way. We have already solved for b2 . Therefore, we can use (3-18) in a solution for b1 . If X1 is just a column of 1s, then the first of these produces the familiar result b1 = y − x 2 b2 − · · · − x K bK ¯ ¯ ¯ (3-21) [which is used in (3-7).] 3.4 PARTIAL REGRESSION AND PARTIAL CORRELATION COEFFICIENTS The use of multiple regression involves a conceptual experiment that we might not be able to carry out in practice, the ceteris paribus analysis familiar in economics. To pursue Example 2.2, a regression equation relating earnings to age and education enables us to do the conceptual experiment of comparing the earnings of two individuals of the same age with different education levels, even if the sample contains no such pair of individuals. It is this characteristic of the regression that is implied by the term partial regression coefficients. The way we obtain this result, as we have seen, is first to regress income and education on age and then to compute the residuals from this regression. By construction, age will not have any power in explaining variation in these residuals. Therefore, any correlation between income and education after this “purging” is independent of (or after removing the effect of) age. The same principle can be applied to the correlation between two variables. To continue our example, to what extent can we assert that this correlation reflects a direct relationship rather than that both income and education tend, on average, to rise as individuals become older? To find out, we would use a partial correlation coefficient, which is computed along the same lines as the partial regression coefficient. In the con- text of our example, the partial correlation coefficient between income and education,
  • 54. Greene-50240 book June 3, 2002 9:52 CHAPTER 3 ✦ Least Squares 29 controlling for the effect of age, is obtained as follows: 1. y∗ = the residuals in a regression of income on a constant and age. 2. z∗ = the residuals in a regression of education on a constant and age. ∗ 3. The partial correlation r yz is the simple correlation between y∗ and z∗ . This calculation might seem to require a formidable amount of computation. There is, however, a convenient shortcut. Once the multiple regression is computed, the t ratio in (4-13) and (4-14) for testing the hypothesis that the coefficient equals zero (e.g., the last column of Table 4.2) can be used to compute 2 ∗2 tz r yz = . (3-22) tz + degrees of freedom 2 The proof of this less than perfectly intuitive result will be useful to illustrate some results on partitioned regression and to put into context two very useful results from least squares algebra. As in Corollary 3.3.1, let W denote the n × (K + 1) regressor matrix [X, z] and let M = I − X(X X)−1 X . We assume that there is a constant term in X, so that the vectors of residuals y∗ = My and z∗ = Mz will have zero sample means. The squared partial correlation is ∗2 (z∗ y∗ )2 r yz = . (z∗ z∗ )(y∗ y∗ ) Let c and u denote the coefficient on z and the vector of residuals in the multiple regression of y on W. The squared t ratio in (3-22) is c2 tz = 2 , uu (W W)−1 n − (K + 1) K+1,K+1 where (W W)−1 −1 K+1,K+1 is the (K + 1) (last) diagonal element of (W W) . The partitioned inverse formula in (A-74) can be applied to the matrix [X, z] [X, z]. This matrix appears in (3-17), with X1 = X and X2 = z. The result is the inverse matrix that appears in (3-19) and (3-20), which implies the first important result. THEOREM 3.4 Diagonal Elements of the Inverse of a Moment Matrix If W = [X, z], then the last diagonal element of (W W)−1 is (z Mz)−1 = (z∗ z∗ )−1 , where z∗ = Mz and M = I − X(X X)−1 X . (Note that this result generalizes the development in Section A.2.8 where X is only the constant term.) If we now use Corollary 3.3.1 and Theorem 3.4 for c, after some manipulation, we obtain ∗2 2 tz (z∗ y∗ )2 r yz = = ∗2 , tz + [n − (K + 1)] 2 (z∗ y∗ )2 + (u u)(z∗ z∗ ) r yz + (u u)/(y∗ y∗ )
  • 55. Greene-50240 book June 3, 2002 9:52 30 CHAPTER 3 ✦ Least Squares where u = y − Xd − zc is the vector of residuals when y is regressed on X and z. Note that unless X z = 0, d will not equal b = (X X)−1 X y. (See Section 8.2.1.) Moreover, unless c = 0, u will not equal e = y − Xb. Now we have shown in Corollary 3.3.1 that c = (z∗ z∗ )−1 (z∗ y∗ ). We also have, from (3-18), that the coefficients on X in the regression of y on W = [X, z] are d = (X X)−1 X (y − zc) = b − (X X)−1 X zc. So, inserting this expression for d in that for u gives u = y − Xb + X(X X)−1 X zc − zc = e − Mzc = e − z∗ c. Now u u = e e + c2 (z∗ z∗ ) − 2cz∗ e. But e = My = y∗ and z∗ e = z∗ y∗ = c(z∗ z∗ ). Inserting this in u u gives our second useful result. THEOREM 3.5 Change in the Sum of Squares When a Variable Is Added to a Regression If e e is the sum of squared residuals when y is regressed on X and u u is the sum of squared residuals when y is regressed on X and z, then u u = e e − c2 (z∗ z∗ ) ≤ e e, (3-23) where c is the coefficient on z in the long regression and z∗ = [I − X(X X)−1 X ]z is the vector of residuals when z is regressed on X. Returning to our derivation, we note that e e = y∗ y∗ and c2 (z∗ z∗ ) = (z∗ y∗ )2 /(z∗ z∗ ). ∗2 Therefore, (u u)/(y∗ y∗ ) = 1 − r yz , and we have our result. Example 3.1 Partial Correlations For the data the application in Section 3.2.2, the simple correlations between investment and ∗ the regressors r yk and the partial correlations r yk between investment and the four regressors (given the other variables) are listed in Table 3.2. As is clear from the table, there is no necessary relation between the simple and partial correlation coefficients. One thing worth TABLE 3.2 Correlations of Investment with Other Variables Simple Partial Correlation Correlation Time 0.7496 −0.9360 GNP 0.8632 0.9680 Interest 0.5871 −0.5167 Inflation 0.4777 −0.0221
  • 56. Greene-50240 book June 3, 2002 9:52 CHAPTER 3 ✦ Least Squares 31 noting is the signs of the coefficients. The signs of the partial correlation coefficients are the same as the signs of the respective regression coefficients, three of which are negative. All the simple correlation coefficients are positive because of the latent “effect” of time. 3.5 GOODNESS OF FIT AND THE ANALYSIS OF VARIANCE The original fitting criterion, the sum of squared residuals, suggests a measure of the fit of the regression line to the data. However, as can easily be verified, the sum of squared residuals can be scaled arbitrarily just by multiplying all the values of y by the desired scale factor. Since the fitted values of the regression are based on the values of x, we might ask instead whether variation in x is a good predictor of variation in y. Figure 3.3 shows three possible cases for a simple linear regression model. The measure of fit described here embodies both the fitting criterion and the covariation of y and x. Variation of the dependent variable is defined in terms of deviations from its mean, (yi − y). The total variation in y is the sum of squared deviations: ¯ n SST = (yi − y)2 . ¯ i=1 FIGURE 3.3 Sample Data. 1.2 6 y y 1.0 .8 4 .6 2 .4 .2 0 .0 x x Ϫ.2 Ϫ2 Ϫ.2 .0 .2 .4 .6 .8 1.0 1.2 Ϫ4 Ϫ2 0 2 4 No Fit Moderate Fit .375 y .300 .225 .150 .075 .000 Ϫ.075 x Ϫ.150 .8 1.0 1.2 1.4 1.6 1.8 2.0 2.2 No Fit
  • 57. Greene-50240 book June 3, 2002 9:52 32 CHAPTER 3 ✦ Least Squares y (xi, yi) yi yi Ϫ yi ˆ ei yi Ϫ y ¯ ˆ yi yi Ϫ y ˆ ¯ b(xi Ϫ x ¯) y ¯ xi Ϫ x ¯ x ¯ xi x FIGURE 3.4 Decomposition of yi . In terms of the regression equation, we may write the full set of observations as y = Xb + e = y + e. ˆ (3-24) For an individual observation, we have yi = yi + ei = xi b + ei . ˆ If the regression contains a constant term, then the residuals will sum to zero and the mean of the predicted values of yi will equal the mean of the actual values. Subtracting y from both sides and using this result and result 2 in Section 3.2.3 gives ¯ yi − y = yi − y + ei = (xi − x) b + ei . ¯ ˆ ¯ ¯ Figure 3.4 illustrates the computation for the two-variable regression. Intuitively, the regression would appear to fit well if the deviations of y from its mean are more largely accounted for by deviations of x from its mean than by the residuals. Since both terms in this decomposition sum to zero, to quantify this fit, we use the sums of squares instead. For the full set of observations, we have M0 y = M0 Xb + M0 e, where M0 is the n × n idempotent matrix that transforms observations into deviations from sample means. (See Section A.2.8.) The column of M0 X corresponding to the constant term is zero, and, since the residuals already have mean zero, M0 e = e. Then, since e M0 X = e X = 0, the total sum of squares is y M0 y = b X M0 Xb + e e. Write this as total sum of squares = regression sum of squares + error sum of squares,
  • 58. Greene-50240 book June 3, 2002 9:52 CHAPTER 3 ✦ Least Squares 33 or SST = SSR + SSE. (3-25) (Note that this is precisely the partitioning that appears at the end of Section 3.2.4.) We can now obtain a measure of how well the regression line fits the data by using the SSR b X M0 Xb ee coefficient of determination: = =1− . (3-26) SST y M0 y y M0 y The coefficient of determination is denoted R2 . As we have shown, it must be between 0 and 1, and it measures the proportion of the total variation in y that is accounted for by variation in the regressors. It equals zero if the regression is a horizontal line, that is, if all the elements of b except the constant term are zero. In this case, the predicted values of y are always y, so deviations of x from its mean do not translate into different ¯ predictions for y. As such, x has no explanatory power. The other extreme, R2 = 1, occurs if the values of x and y all lie in the same hyperplane (on a straight line for a two variable regression) so that the residuals are all zero. If all the values of yi lie on a vertical line, then R2 has no meaning and cannot be computed. Regression analysis is often used for forecasting. In this case, we are interested in how well the regression model predicts movements in the dependent variable. With this in mind, an equivalent way to compute R2 is also useful. First b X M0 Xb = y M0 y, ˆ ˆ but y = Xb, y = y + e, M0 e = e, and X e = 0, so y M0 y = y M0 y. Multiply R2 = ˆ ˆ ˆ ˆ ˆ y M0 y/y M0 y = y M0 y/y M0 y by 1 = y M0 y/y M0 y to obtain ˆ ˆ ˆ ˆ ˆ ˆ [ i (yi − y)( yi − y)]2 ¯ ˆ ˆ ¯ R2 = , (3-27) [ i (yi − y) ¯ 2 ][ i ( yi − y) ] ˆ ˆ 2 ¯ which is the squared correlation between the observed values of y and the predictions produced by the estimated regression equation. Example 3.2 Fit of a Consumption Function The data plotted in Figure 2.1 are listed in Appendix Table F2.1. For these data, where y is C and x is X , we have y = 273.2727, x = 323.2727, Syy = 12,618.182, Sxx = 12,300.182, ¯ ¯ Sxy = 8,423.182, so SST = 12,618.182, b = 8,423.182/12,300.182 = 0.6848014, SSR = b2 Sxx = 5,768.2068, and SSE = SST−SSR = 6,849.975. Then R2 = b2 Sxx /SST = 0.457135. As can be seen in Figure 2.1, this is a moderate fit, although it is not particularly good for aggregate time-series data. On the other hand, it is clear that not accounting for the anomalous wartime data has degraded the fit of the model. This value is the R2 for the model indicated by the dotted line in the figure. By simply omitting the years 1942–1945 from the sample and doing these computations with the remaining seven observations—the heavy solid line—we obtain an R2 of 0.93697. Alternatively, by creating a variable WAR which equals 1 in the years 1942–1945 and zero otherwise and including this in the model, which produces the model shown by the two solid lines, the R2 rises to 0.94639. We can summarize the calculation of R2 in an analysis of variance table, which might appear as shown in Table 3.3. Example 3.3 Analysis of Variance for an Investment Equation The analysis of variance table for the investment equation of Section 3.2.2 is given in Table 3.4.
  • 59. Greene-50240 book June 3, 2002 9:52 34 CHAPTER 3 ✦ Least Squares TABLE 3.3 Analysis of Variance Source Degrees of Freedom Mean Square Regression b X y − n y2 ¯ K − 1 (assuming a constant term) Residual ee n− K s2 Total y y − n y2 ¯ n−1 Syy /(n − 1) = s y 2 Coefficient of R2 = 1 − e e/(y y − n y2 ) ¯ determination TABLE 3.4 Analysis of Variance for the Investment Equation Source Degrees of Freedom Mean Square Regression 0.0159025 4 0.003976 Residual 0.0004508 10 0.00004508 Total 0.016353 14 0.0011681 R2 = 0.0159025/0.016353 = 0.97245. 3.5.1 THE ADJUSTED R-SQUARED AND A MEASURE OF FIT There are some problems with the use of R2 in analyzing goodness of fit. The first concerns the number of degrees of freedom used up in estimating the parameters. R2 will never decrease when another variable is added to a regression equation. Equa- tion (3-23) provides a convenient means for us to establish this result. Once again, we are comparing a regression of y on X with sum of squared residuals e e to a regression of y on X and an additional variable z, which produces sum of squared residuals u u. Recall the vectors of residuals z∗ = Mz and y∗ = My = e, which implies that e e = (y∗ y∗ ). Let c be the coefficient on z in the longer regression. Then c = (z∗ z∗ )−1 (z∗ y∗ ), and inserting this in (3-23) produces (z∗ y∗ )2 ∗2 uu=ee− = e e 1 − r yz , (3-28) (z∗ z∗ ) ∗ where r yz is the partial correlation between y and z, controlling for X. Now divide through both sides of the equality by y M0 y. From (3-26), u u/y M0 y is (1 − RXz) for the 2 regression on X and z and e e/y M y is (1 − RX ). Rearranging the result produces the 0 2 following: THEOREM 3.6 Change in R2 When a Variable Is Added to a Regression 2 Let RXz be the coefficient of determination in the regression of y on X and an 2 additional variable z, let RX be the same for the regression of y on X alone, and ∗ let r yz be the partial correlation between y and z, controlling for X. Then ∗2 RXz = RX + 1 − RX r yz . 2 2 2 (3-29)
  • 60. Greene-50240 book June 3, 2002 9:52 CHAPTER 3 ✦ Least Squares 35 Thus, the R2 in the longer regression cannot be smaller. It is tempting to exploit this result by just adding variables to the model; R2 will continue to rise to its limit of 1.3 The adjusted R2 (for degrees of freedom), which incorporates a penalty for these results is computed as follows: e e/(n − K) 4 R2 = 1 − ¯ . (3-30) y M0 y/(n − 1) ¯ For computational purposes, the connection between R2 and R2 is n−1 R2 = 1 − ¯ (1 − R2 ). n− K The adjusted R2 may decline when a variable is added to the set of independent variables. ¯ Indeed, R2 may even be negative. To consider an admittedly extreme case, suppose that x and y have a sample correlation of zero. Then the adjusted R2 will equal −1/(n − 2). (Thus, the name “adjusted R-squared” is a bit misleading—as can be seen in (3-30), ¯ ¯ R2 is not actually computed as the square of any quantity.) Whether R2 rises or falls depends on whether the contribution of the new variable to the fit of the regression more than offsets the correction for the loss of an additional degree of freedom. The general result (the proof of which is left as an exercise) is as follows. ¯ THEOREM 3.7 Change in R2 When a Variable Is Added to a Regression ¯ In a multiple regression, R2 will fall (rise) when the variable x is deleted from the regression if the t ratio associated with this variable is greater (less) than 1. We have shown that R2 will never fall when a variable is added to the regression. We now consider this result more generally. The change in the residual sum of squares when a set of variables X2 is added to the regression is e1,2 e1,2 = e1 e1 − b2 X2 M1 X2 b2 , where we use subscript 1 to indicate the regression based on X1 alone and 1, 2 to indicate the use of both X1 and X2 . The coefficient vector b2 is the coefficients on X2 in the multiple regression of y on X1 and X2 . [See (3-19) and (3-20) for definitions of b2 and M1 .] Therefore, e1 e1 − b2 X2 M1 X2 b2 b X M1 X 2 b 2 R1,2 = 1 − 2 = R1 + 2 2 0 2 , y M0 y yM y 3 This result comes at a cost, however. The parameter estimates become progressively less precise as we do so. We will pursue this result in Chapter 4. 4 This measure is sometimes advocated on the basis of the unbiasedness of the two quantities in the fraction. Since the ratio is not an unbiased estimator of any population quantity, it is difficult to justify the adjustment on this basis.
  • 61. Greene-50240 book June 3, 2002 9:52 36 CHAPTER 3 ✦ Least Squares 2 which is greater than R1 unless b2 equals zero. (M1 X2 could not be zero unless X2 was a linear function of X1 , in which case the regression on X1 and X2 could not be computed.) This equation can be manipulated a bit further to obtain y M1 y b2 X2 M1 X2 b2 R1,2 = R1 + 2 2 . y M0 y y M1 y But y M1 y = e1 e1 , so the first term in the product is 1 − R1 . The second is the multiple 2 correlation in the regression of M1 y on M1 X2 , or the partial correlation (after the effect of X1 is removed) in the regression of y on X2 . Collecting terms, we have R1,2 = R1 + 1 − R1 r y2·1 . 2 2 2 2 [This is the multivariate counterpart to (3-29).] Therefore, it is possible to push R2 as high as desired just by adding regressors. This possibility motivates the use of the adjusted R-squared in (3-30), instead of R2 ¯ as a method of choosing among alternative models. Since R2 incorporates a penalty for reducing the degrees of freedom while still revealing an improvement in fit, one ¯ possibility is to choose the specification that maximizes R2 . It has been suggested that the adjusted R-squared does not penalize the loss of degrees of freedom heavily enough.5 Some alternatives that have been proposed for comparing models (which we index by j) are n + Kj ˜2 Rj = 1 − 1 − R2 , n − Kj j which minimizes Amemiya’s (1985) prediction criterion, ejej Kj Kj PC j = 1+ = s2 1 + n − Kj n j n and the Akaike and Bayesian information criteria which are given in (8-18) and (8-19). 3.5.2 R-SQUARED AND THE CONSTANT TERM IN THE MODEL A second difficulty with R2 concerns the constant term in the model. The proof that 0 ≤ R2 ≤ 1 requires X to contain a column of 1s. If not, then (1) M0 e = e and (2) e M0 X = 0, and the term 2e M0 Xb in y M0 y = (M0 Xb + M0 e) (M0 Xb + M0 e) in the preceding expansion will not drop out. Consequently, when we compute ee R2 = 1 − , y M0 y the result is unpredictable. It will never be higher and can be far lower than the same figure computed for the regression with a constant term included. It can even be negative. Computer packages differ in their computation of R2 . An alternative computation, bXy R2 = , y M0 y is equally problematic. Again, this calculation will differ from the one obtained with the constant term included; this time, R2 may be larger than 1. Some computer packages 5 See, for example, Amemiya (1985, pp. 50–51).
  • 62. Greene-50240 book June 3, 2002 9:52 CHAPTER 3 ✦ Least Squares 37 bypass these difficulties by reporting a third “R2 ,” the squared sample correlation be- tween the actual values of y and the fitted values from the regression. This approach could be deceptive. If the regression contains a constant term, then, as we have seen, all three computations give the same answer. Even if not, this last one will still produce a value between zero and one. But, it is not a proportion of variation explained. On the other hand, for the purpose of comparing models, this squared correlation might well be a useful descriptive device. It is important for users of computer packages to be aware of how the reported R2 is computed. Indeed, some packages will give a warning in the results when a regression is fit without a constant or by some technique other than linear least squares. 3.5.3 COMPARING MODELS The value of R2 we obtained for the consumption function in Example 3.2 seems high in an absolute sense. Is it? Unfortunately, there is no absolute basis for comparison. In fact, in using aggregate time-series data, coefficients of determination this high are routine. In terms of the values one normally encounters in cross sections, an R2 of 0.5 is relatively high. Coefficients of determination in cross sections of individual data as high as 0.2 are sometimes noteworthy. The point of this discussion is that whether a regression line provides a good fit to a body of data depends on the setting. Little can be said about the relative quality of fits of regression lines in different contexts or in different data sets even if supposedly generated by the same data gener- ating mechanism. One must be careful, however, even in a single context, to be sure to use the same basis for comparison for competing models. Usually, this concern is about how the dependent variable is computed. For example, a perennial question concerns whether a linear or loglinear model fits the data better. Unfortunately, the question cannot be answered with a direct comparison. An R2 for the linear regression model is different from an R2 for the loglinear model. Variation in y is different from variation in ln y. The latter R2 will typically be larger, but this does not imply that the loglinear model is a better fit in some absolute sense. It is worth emphasizing that R2 is a measure of linear association between x and y. For example, the third panel of Figure 3.3 shows data that might arise from the model yi = α + β(xi − γ )2 + εi . (The constant γ allows x to be distributed about some value other than zero.) The relationship between y and x in this model is nonlinear, and a linear regression would find no fit. A final word of caution is in order. The interpretation of R2 as a proportion of variation explained is dependent on the use of least squares to compute the fitted values. It is always correct to write yi − y = ( yi − y) + ei ¯ ˆ ¯ regardless of how yi is computed. Thus, one might use yi = exp(lnyi ) from a loglinear ˆ ˆ model in computing the sum of squares on the two sides, however, the cross-product term vanishes only if least squares is used to compute the fitted values and if the model contains a constant term. Thus, in in the suggested example, it would still be unclear whether the linear or loglinear model fits better; the cross-product term has been ignored
  • 63. Greene-50240 book June 3, 2002 9:52 38 CHAPTER 3 ✦ Least Squares in computing R2 for the loglinear model. Only in the case of least squares applied to a linear equation with a constant term can R2 be interpreted as the proportion of variation in y explained by variation in x. An analogous computation can be done without computing deviations from means if the regression does not contain a constant term. Other purely algebraic artifacts will crop up in regressions without a constant, however. For example, the value of R2 will change when the same constant is added to each observation on y, but it is obvious that nothing fundamental has changed in the regression relationship. One should be wary (even skeptical) in the calculation and interpretation of fit measures for regressions without constant terms. 3.6 SUMMARY AND CONCLUSIONS This chapter has described the purely algebraic exercise of fitting a line (hyperplane) to a set of points using the method of least squares. We considered the primary problem first, using a data set of n observations on K variables. We then examined several aspects of the solution, including the nature of the projection and residual maker matrices and several useful algebraic results relating to the computation of the residuals and their sum of squares. We also examined the difference between gross or simple regression and correlation and multiple regression by defining “partial regression coefficients” and “partial correlation coefficients.” The Frisch-Waugh Theorem (3.3) is a fundamentally useful tool in regression analysis which enables us to obtain in closed form the expression for a subvector of a vector of regression coefficients. We examined several aspects of the partitioned regression, including how the fit of the regression model changes when variables are added to it or removed from it. Finally, we took a closer look at the conventional measure of how well the fitted regression line predicts or “fits” the data. Key Terms and Concepts • Adjusted R-squared • Moment matrix • Prediction criterion • Analysis of variance • Multiple correlation • Population quantity • Bivariate regression • Multiple regression • Population regression • Coefficient of determination • Netting out • Projection • Disturbance • Normal equations • Projection matrix • Fitting criterion • Orthogonal regression • Residual • Frisch-Waugh theorem • Partial correlation • Residual maker • Goodness of fit coefficient • Total variation • Least squares • Partial regression coefficient • Least squares normal • Partialing out equations • Partitioned regression Exercises 1. The Two Variable Regression. For the regression model y = α + βx + ε, a. Show that the least squares normal equations imply i ei = 0 and i xi ei = 0. b. Show that the solution for the constant term is a = y − bx. ¯ ¯ n n c. Show that the solution for b is b = [ i=1 (xi − x)(yi − y)]/[ i=1 (xi − x)2 ]. ¯ ¯ ¯
  • 64. Greene-50240 book June 3, 2002 9:52 CHAPTER 3 ✦ Least Squares 39 d. Prove that these two values uniquely minimize the sum of squares by showing that the diagonal elements of the second derivatives matrix of the sum of squares with respect to the parameters are both positive and that the determinant is n n 4n[( i=1 xi2 ) − nx 2 ] = 4n[ i=1 (xi − x)2 ], which is positive unless all values of ¯ ¯ x are the same. 2. Change in the sum of squares. Suppose that b is the least squares coefficient vector in the regression of y on X and that c is any other K × 1 vector. Prove that the difference in the two sums of squared residuals is (y − Xc) (y − Xc) − (y − Xb) (y − Xb) = (c − b) X X(c − b). Prove that this difference is positive. 3. Linear Transformations of the data. Consider the least squares regression of y on K variables (with a constant) X. Consider an alternative set of regressors Z = XP, where P is a nonsingular matrix. Thus, each column of Z is a mixture of some of the columns of X. Prove that the residual vectors in the regressions of y on X and y on Z are identical. What relevance does this have to the question of changing the fit of a regression by changing the units of measurement of the independent variables? 4. Partial Frisch and Waugh. In the least squares regression of y on a constant and X, to compute the regression coefficients on X, we can first transform y to deviations from the mean y and, likewise, transform each column of X to deviations from the ¯ respective column mean; second, regress the transformed y on the transformed X without a constant. Do we get the same result if we only transform y? What if we only transform X? 5. Residual makers. What is the result of the matrix product M1 M where M1 is defined in (3-19) and M is defined in (3-14)? 6. Adding an observation. A data set consists of n observations on Xn and yn . The least squares estimator based on these n observations is bn = (Xn Xn )−1 Xn yn . Another observation, xs and ys , becomes available. Prove that the least squares estimator computed using this additional observation is 1 bn,s = bn + (X Xn )−1 xs (ys − xs bn ). 1 + xs (Xn Xn )−1 xs n Note that the last term is es , the residual from the prediction of ys using the coeffi- cients based on Xn and bn . Conclude that the new data change the results of least squares only if the new observation on y cannot be perfectly predicted using the information already in hand. 7. Deleting an observation. A common strategy for handling a case in which an obser- vation is missing data for one or more variables is to fill those missing variables with 0s and add a variable to the model that takes the value 1 for that one observation and 0 for all other observations. Show that this ‘strategy’ is equivalent to discard- ing the observation as regards the computation of b but it does have an effect on R2 . Consider the special case in which X contains only a constant and one variable. Show that replacing missing values of x with the mean of the complete observations has the same effect as adding the new variable. 8. Demand system estimation. Let Y denote total expenditure on consumer durables, nondurables, and services and Ed , En , and Es are the expenditures on the three
  • 65. Greene-50240 book June 3, 2002 9:52 40 CHAPTER 3 ✦ Least Squares categories. As defined, Y = Ed + En + Es . Now, consider the expenditure system Ed = αd + βd Y + γdd Pd + γdn Pn + γds Ps + εd , En = αn + βn Y + γnd Pd + γnn Pn + γns Ps + εn , Es = αs + βs Y + γsd Pd + γsn Pn + γss Ps + εs . Prove that if all equations are estimated by ordinary least squares, then the sum of the expenditure coefficients will be 1 and the four other column sums in the preceding model will be zero. 9. Change in adjusted R2 . Prove that the adjusted R2 in (3-30) rises (falls) when variable xk is deleted from the regression if the square of the t ratio on xk in the multiple regression is less (greater) than 1. 10. Regression without a constant. Suppose that you estimate a multiple regression first with then without a constant. Whether the R2 is higher in the second case than the first will depend in part on how it is computed. Using the (relatively) standard method R2 = 1 − (e e/y M0 y), which regression will have a higher R2 ? 11. Three variables, N, D, and Y, all have zero means and unit variances. A fourth variable is C = N + D. In the regression of C on Y, the slope is 0.8. In the regression of C on N, the slope is 0.5. In the regression of D on Y, the slope is 0.4. What is the sum of squared residuals in the regression of C on D? There are 21 observations and all moments are computed using 1/(n − 1) as the divisor. 12. Using the matrices of sums of squares and cross products immediately preceding Section 3.2.3, compute the coefficients in the multiple regression of real investment on a constant, real GNP and the interest rate. Compute R2 . 13. In the December, 1969, American Economic Review (pp. 886–896), Nathaniel Leff reports the following least squares regression results for a cross section study of the effect of age composition on savings in 74 countries in 1964: ln S/Y = 7.3439 + 0.1596 ln Y/N + 0.0254 ln G − 1.3520 ln D1 − 0.3990 ln D2 ln S/N = 8.7851 + 1.1486 ln Y/N + 0.0265 ln G − 1.3438 ln D1 − 0.3966 ln D2 where S/Y = domestic savings ratio, S/N = per capita savings, Y/N = per capita income, D1 = percentage of the population under 15, D2 = percentage of the popu- lation over 64, and G = growth rate of per capita income. Are these results correct? Explain.
  • 66. Greene-50240 book June 3, 2002 9:57 4 FINITE-SAMPLE PROPERTIES OF THE LEAST SQUARES ESTIMATOR Q 4.1 INTRODUCTION Chapter 3 treated fitting the linear regression to the data as a purely algebraic exercise. We will now examine the least squares estimator from a statistical viewpoint. This chapter will consider exact, finite-sample results such as unbiased estimation and the precise distributions of certain test statistics. Some of these results require fairly strong assumptions, such as nonstochastic regressors or normally distributed disturbances. In the next chapter, we will turn to the properties of the least squares estimator in more general cases. In these settings, we rely on approximations that do not hold as exact results but which do improve as the sample size increases. There are other candidates for estimating β. In a two-variable case, for example, we might use the intercept, a, and slope, b, of the line between the points with the largest and smallest values of x. Alternatively, we might find the a and b that minimize the sum of absolute values of the residuals. The question of which estimator to choose is usually based on the statistical properties of the candidates, such as unbiasedness, efficiency, and precision. These, in turn, frequently depend on the particular distribution that we assume produced the data. However, a number of desirable properties can be obtained for the least squares estimator even without specifying a particular distribution for the disturbances in the regression. In this chapter, we will examine in detail the least squares as an estimator of the model parameters of the classical model (defined in the following Table 4.1). We begin in Section 4.2 by returning to the question raised but not answered in Footnote 1, Chap- ter 3, that is, why least squares? We will then analyze the estimator in detail. We take Assumption A1, linearity of the model as given, though in Section 4.2, we will consider briefly the possibility of a different predictor for y. Assumption A2, the identification condition that the data matrix have full rank is considered in Section 4.9 where data complications that arise in practice are discussed. The near failure of this assumption is a recurrent problem in “real world” data. Section 4.3 is concerned with unbiased estimation. Assumption A3, that the disturbances and the independent variables are uncorrelated, is a pivotal result in this discussion. Assumption A4, homoscedasticity and nonautocorrelation of the disturbances, in contrast to A3, only has relevance to whether least squares is an optimal use of the data. As noted, there are alternative estimators available, but with Assumption A4, the least squares estimator is usually going to be preferable. Sections 4.4 and 4.5 present several statistical results for the least squares estimator that depend crucially on this assumption. The assumption that the data in X are nonstochastic, known constants, has some implications for how certain derivations 41
  • 67. Greene-50240 book June 3, 2002 9:57 42 CHAPTER 4 ✦ Finite-Sample Properties of the Least Squares Estimator TABLE 4.1 Assumptions of the Classical Linear Regression Model A1. Linearity: yi = xi1 β1 + xi2 β2 + · · · + β K xi K + εi . A2. Full rank: The n × K sample data matrix, X has full column rank. A3. Exogeneity of the independent variables: E [εi | x j1 , x j2 , . . . , x j K ] = 0, i, j = 1, . . . , n. There is no correlation between the disturbances and the independent variables. A4. Homoscedasticity and nonautocorrelation: Each disturbance, εi has the same finite variance, σ 2 and is uncorrelated with every other disturbance, ε j . A5. Exogenously generated data (xi1 , xi2 , . . . , xi K ) i = 1, . . . , n. A6. Normal distribution: The disturbances are normally distributed. proceed, but in practical terms, is a minor consideration. Indeed, nearly all that we do with the regression model departs from this assumption fairly quickly. It serves only as a useful departure point. The issue is considered in Section 4.5. Finally, the normality of the disturbances assumed in A6 is crucial in obtaining the sampling distributions of several useful statistics that are used in the analysis of the linear model. We note that in the course of our analysis of the linear model as we proceed through Chapter 9, all six of these assumptions will be discarded. 4.2 MOTIVATING LEAST SQUARES Ease of computation is one reason that least squares is so popular. However, there are several other justifications for this technique. First, least squares is a natural approach to estimation, which makes explicit use of the structure of the model as laid out in the assumptions. Second, even if the true model is not a linear regression, the regression line fit by least squares is an optimal linear predictor for the dependent variable. Thus, it enjoys a sort of robustness that other estimators do not. Finally, under the very specific assumptions of the classical model, by one reasonable criterion, least squares will be the most efficient use of the data. We will consider each of these in turn. 4.2.1 THE POPULATION ORTHOGONALITY CONDITIONS Let x denote the vector of independent variables in the population regression model and for the moment, based on assumption A5, the data may be stochastic or nonstochastic. Assumption A3 states that the disturbances in the population are stochastically or- thogonal to the independent variables in the model; that is, E [ε | x] = 0. It follows that Cov[x, ε] = 0. Since (by the law of iterated expectations—Theorem B.1) Ex {E [ε | x]} = E [ε] = 0, we may write this as Ex Eε [xε] = Ex Ey [x(y − x β)] = 0 or Ex Ey [xy] = Ex [xx ]β. (4-1) (The right-hand side is not a function of y so the expectation is taken only over x.) Now, recall the least squares normal equations, X y = X Xb. Divide this by n and write it as
  • 68. Greene-50240 book June 3, 2002 9:57 CHAPTER 4 ✦ Finite-Sample Properties of the Least Squares Estimator 43 a summation to obtain n n 1 1 xi yi = xi xi b. (4-2) n n i=1 i=1 Equation (4-1) is a population relationship. Equation (4-2) is a sample analog. Assuming the conditions underlying the laws of large numbers presented in Appendix D are met, the sums on the left hand and right hand sides of (4-2) are estimators of their counterparts in (4-1). Thus, by using least squares, we are mimicking in the sample the relationship in the population. We’ll return to this approach to estimation in Chapters 10 and 18 under the subject of GMM estimation. 4.2.2 MINIMUM MEAN SQUARED ERROR PREDICTOR As an alternative approach, consider the problem of finding an optimal linear predictor for y. Once again, ignore Assumption A6 and, in addition, drop Assumption A1 that the conditional mean function, E [y | x] is linear. For the criterion, we will use the mean squared error rule, so we seek the minimum mean squared error linear predictor of y, which we’ll denote x γ . The expected squared error of this predictor is MSE = Ey Ex [y − x γ ]2 . This can be written as 2 2 MSE = Ey,x y − E [y | x] + Ey,x E [y | x] − x γ . We seek the γ that minimizes this expectation. The first term is not a function of γ , so only the second term needs to be minimized. Note that this term is not a function of y, so the outer expectation is actually superfluous. But, we will need it shortly, so we will carry it for the present. The necessary condition is ∂ Ey Ex [E(y | x) − x γ ]2 ∂[E(y | x) − x γ ]2 = Ey Ex ∂γ ∂γ = −2Ey Ex x[E(y | x) − x γ ] = 0. Note that we have interchanged the operations of expectation and differentiation in the middle step, since the range of integration is not a function of γ . Finally, we have the equivalent condition Ey Ex [xE(y | x)] = Ey Ex [xx ]γ . The left hand side of this result is Ex Ey [xE(y | x)] = Cov[x, E(y | x)] +E [x]Ex [E(y | x)] = Cov[x, y] + E [x]E [y] = Ex Ey [xy]. (We have used theorem B.2.) Therefore, the neces- sary condition for finding the minimum MSE predictor is Ex Ey [xy] = Ex Ey [xx ]γ . (4-3) This is the same as (4-1), which takes us to the least squares condition once again. Assuming that these expectations exist, they would be estimated by the sums in (4-2), which means that regardless of the form of the conditional mean, least squares is an estimator of the coefficients of the minimum expected mean squared error lin- ear predictor. We have yet to establish the conditions necessary for the if part of the
  • 69. Greene-50240 book June 3, 2002 9:57 44 CHAPTER 4 ✦ Finite-Sample Properties of the Least Squares Estimator theorem, but this is an opportune time to make it explicit: THEOREM 4.1 Minimum Mean Squared Error Predictor If the data generating mechanism generating (x i , yi )i=1,...,n is such that the law of large numbers applies to the estimators in (4-2) of the matrices in (4-1), then the minimum expected squared error linear predictor of yi is estimated by the least squares regression line. 4.2.3 MINIMUM VARIANCE LINEAR UNBIASED ESTIMATION Finally, consider the problem of finding a linear unbiased estimator. If we seek the one which has smallest variance, we will be led once again to least squares. This proposition will be proved in Section 4.4. The preceding does not assert that no other competing estimator would ever be preferable to least squares. We have restricted attention to linear estimators. The result immediately above precludes what might be an acceptably biased estimator. And, of course, the assumptions of the model might themselves not be valid. Although A5 and A6 are ultimately of minor consequence, the failure of any of the first four assumptions would make least squares much less attractive than we have suggested here. 4.3 UNBIASED ESTIMATION The least squares estimator is unbiased in every sample. To show this, write b = (X X)−1 X y = (X X)−1 X (Xβ + ε) = β + (X X)−1 X ε. (4-4) Now, take expectations, iterating over X; E [b | X] = β + E [(X X)−1 X ε | X]. By Assumption A3, the second term is 0, so E [b | X] = β. Therefore, E [b] = EX E [b | X] = EX [β] = β. The interpretation of this result is that for any particular set of observations, X, the least squares estimator has expectation β. Therefore, when we average this over the possible values of X we find the unconditional mean is β as well. Example 4.1 The Sampling Distribution of a Least Squares Estimator The following sampling experiment, which can be replicated in any computer program that provides a random number generator and a means of drawing a random sample of observa- tions from a master data set, shows the nature of a sampling distribution and the implication of unbiasedness. We drew two samples of 10,000 random draws on wi and xi from the standard
  • 70. Greene-50240 book June 3, 2002 9:57 CHAPTER 4 ✦ Finite-Sample Properties of the Least Squares Estimator 45 100 75 Frequency 50 25 0 .338 .388 .438 .488 .538 .588 .638 .688 b FIGURE 4.1 Histogram for Sampled Least Squares Regression Slopes. normal distribution (mean zero, variance 1). We then generated a set of εi s equal to 0.5wi and yi = 0.5 + 0.5xi + εi . We take this to be our population. We then drew 500 random samples of 100 observations from this population, and with each one, computed the least squares 100 100 slope (using at replication r, br = [ j =1 ( x j r − xr ) y j r ]/[ j =1 ( x j r − xr ) 2 ]) . The histogram in ¯ ¯ Figure 4.1 shows the result of the experiment. Note that the distribution of slopes has a mean roughly equal to the “true value” of 0.5, and it has a substantial variance, reflecting the fact that the regression slope, like any other statistic computed from the sample, is a random variable. The concept of unbiasedness relates to the central tendency of this distribution of values obtained in repeated sampling from the population. 4.4 THE VARIANCE OF THE LEAST SQUARES ESTIMATOR AND THE GAUSS MARKOV THEOREM If the regressors can be treated as nonstochastic, as they would be in an experimental situation in which the analyst chooses the values in X, then the sampling variance of the least squares estimator can be derived by treating X as a matrix of constants. Alternatively, we can allow X to be stochastic, do the analysis conditionally on the observed X, then consider averaging over X as we did in the preceding section. Using (4-4) again, we have b = (X X)−1 X (Xβ + ε) = β + (X X)−1 X ε. (4-5) Since we can write b = β + Aε, where A is (X X)−1 X , b is a linear function of the disturbances, which by the definition we will use makes it a linear estimator. As we have
  • 71. Greene-50240 book June 3, 2002 9:57 46 CHAPTER 4 ✦ Finite-Sample Properties of the Least Squares Estimator seen, the expected value of the second term in (4-5) is 0. Therefore, regardless of the distribution of ε, under our other assumptions, b is a linear, unbiased estimator of β. The covariance matrix of the least squares slope estimator is Var[b | X] = E [(b − β)(b − β) | X] = E [(X X)−1 X εε X(X X)−1 | X] = (X X)−1 X E [εε | X]X(X X)−1 = (X X)−1 X (σ 2 I)X(X X)−1 = σ 2 (X X)−1 . Example 4.2 Sampling Variance in the Two-Variable Regression Model Suppose that X contains only a constant term (column of 1s) and a single regressor x. The lower right element of σ 2 ( X X) −1 is σ2 Var [b| x] = Var [b − β | x] = n . i =1 ( xi − x) 2 ¯ Note, in particular, the denominator of the variance of b. The greater the variation in x, the smaller this variance. For example, consider the problem of estimating the slopes of the two regressions in Figure 4.2. A more precise result will be obtained for the data in the right-hand panel of the figure. We will now obtain a general result for the class of linear unbiased estimators of β. Let b0 = Cy be another linear unbiased estimator of β, where C is a K × n matrix. If b0 is unbiased, then E [Cy | X] = E [(CXβ + Cε) | X] = β, which implies that CX = I. There are many candidates. For example, consider using just the first K (or, any K) linearly independent rows of X. Then C = [X−1 : 0], where X−1 0 0 FIGURE 4.2 Effect of Increased Variation in x Given the Same Conditional and Overall Variation in y. y y x x
  • 72. Greene-50240 book June 3, 2002 9:57 CHAPTER 4 ✦ Finite-Sample Properties of the Least Squares Estimator 47 is the transpose of the matrix formed from the K rows of X. The covariance matrix of b0 can be found by replacing (X X)−1 X with C in (4-5); the result is Var[b0 | X] = σ 2 CC . Now let D = C − (X X)−1 X so Dy = b0 − b. Then, Var[b0 | X] = σ 2 [(D + (X X)−1 X )(D + (X X)−1 X ) ]. We know that CX = I = DX + (X X)−1 (X X), so DX must equal 0. Therefore, Var[b0 | X] = σ 2 (X X)−1 + σ 2 DD = Var[b | X] + σ 2 DD . Since a quadratic form in DD is q DD q = z z ≥ 0, the conditional covariance matrix of b0 equals that of b plus a nonnegative definite matrix. Therefore, every quadratic form in Var[b0 | X] is larger than the corresponding quadratic form in Var[b | X], which implies a very important property of the least squares coefficient vector. THEOREM 4.2 Gauss–Markov Theorem In the classical linear regression model with regressor matrix X, the least squares estimator b is the minimum variance linear unbiased estimator of β. For any vector of constants w, the minimum variance linear unbiased estimator of w β in the classical regression model is w b, where b is the least squares estimator. The proof of the second statement follows from the previous derivation, since the variance of w b is a quadratic form in Var[b | X], and likewise for any b0 , and proves that each individual slope estimator bk is the best linear unbiased estimator of βk. (Let w be all zeros except for a one in the kth position.) The theorem is much broader than this, however, since the result also applies to every other linear combination of the ele- ments of β. 4.5 THE IMPLICATIONS OF STOCHASTIC REGRESSORS The preceding analysis is done conditionally on the observed data. A convenient method of obtaining the unconditional statistical properties of b is to obtain the desired results conditioned on X first, then find the unconditional result by “averaging” (e.g., by in- tegrating over) the conditional distributions. The crux of the argument is that if we can establish unbiasedness conditionally on an arbitrary X, then we can average over X’s to obtain an unconditional result. We have already used this approach to show the unconditional unbiasedness of b in Section 4.3, so we now turn to the conditional variance. The conditional variance of b is Var[b | X] = σ 2 (X X)−1 .
  • 73. Greene-50240 book June 3, 2002 9:57 48 CHAPTER 4 ✦ Finite-Sample Properties of the Least Squares Estimator For the exact variance, we use the decomposition of variance of (B-70): Var[b] = EX [Var[b | X]] + VarX [E [b | X]]. The second term is zero since E [b | X] = β for all X, so Var[b] = EX [σ 2 (X X)−1 ] = σ 2 EX [(X X)−1 ]. Our earlier conclusion is altered slightly. We must replace (X X)−1 with its expected value to get the appropriate covariance matrix, which brings a subtle change in the interpretation of these results. The unconditional variance of b can only be described in terms of the average behavior of X, so to proceed further, it would be necessary to make some assumptions about the variances and covariances of the regressors. We will return to this subject in Chapter 5. We showed in Section 4.4 that Var[b | X] ≤ Var[b0 | X] for any b0 = b and for the specific X in our sample. But if this inequality holds for every particular X, then it must hold for Var[b] = EX [Var[b | X]]. That is, if it holds for every particular X, then it must hold over the average value(s) of X. The conclusion, therefore, is that the important results we have obtained thus far for the least squares estimator, unbiasedness, and the Gauss-Markov theorem hold whether or not we regard X as stochastic. THEOREM 4.3 Gauss–Markov Theorem (Concluded) In the classical linear regression model, the least squares estimator b is the minimum variance linear unbiased estimator of β whether X is stochastic or nonstochastic, so long as the other assumptions of the model continue to hold. 4.6 ESTIMATING THE VARIANCE OF THE LEAST SQUARES ESTIMATOR If we wish to test hypotheses about β or to form confidence intervals, then we will require a sample estimate of the covariance matrix Var[b | X] = σ 2 (X X)−1 . The population parameter σ 2 remains to be estimated. Since σ 2 is the expected value of εi2 and ei is an estimate of εi , by analogy, n 1 σ2 = ˆ ei2 n i=1
  • 74. Greene-50240 book June 3, 2002 9:57 CHAPTER 4 ✦ Finite-Sample Properties of the Least Squares Estimator 49 would seem to be a natural estimator. But the least squares residuals are imperfect estimates of their population counterparts; ei = yi − xi b = εi − xi (b − β). The estimator is distorted (as might be expected) because β is not observed directly. The expected square on the right-hand side involves a second term that might not have expected value zero. The least squares residuals are e = My = M[Xβ + ε] = Mε, as MX = 0. [See (3-15).] An estimator of σ 2 will be based on the sum of squared residuals: e e = ε Mε. (4-6) The expected value of this quadratic form is E [e e | X] = E [ε Mε | X]. The scalar ε Mε is a 1 × 1 matrix, so it is equal to its trace. By using the result on cyclic permutations (A-94), E [tr(ε Mε) | X] = E [tr(Mεε ) | X]. Since M is a function of X, the result is tr ME [εε | X] = tr(Mσ 2 I) = σ 2 tr(M). The trace of M is tr[In − X(X X)−1 X ] = tr(In ) − tr[(X X)−1 X X] = tr(In ) − tr(I K ) = n − K. Therefore, E [e e | X] = (n − K)σ 2 , so the natural estimator is biased toward zero, although the bias becomes smaller as the sample size increases. An unbiased estimator of σ 2 is ee s2 = . (4-7) n− K The estimator is unbiased unconditionally as well, since E [s 2 ] = EX E [s 2 | X] = EX [σ 2 ] = σ 2 . The standard error of the regression is s, the square root of s 2 . With s 2 , we can then compute Est. Var[b | X] = s 2 (X X)−1 . Henceforth, we shall use the notation Est. Var[·] to indicate a sample estimate of the sampling variance of an estimator. The square root of the kth diagonal element of 1/2 this matrix, [s 2 (X X)−1 ]kk , is the standard error of the estimator bk, which is often denoted simply “the standard error of bk.”
  • 75. Greene-50240 book June 3, 2002 9:57 50 CHAPTER 4 ✦ Finite-Sample Properties of the Least Squares Estimator 4.7 THE NORMALITY ASSUMPTION AND BASIC STATISTICAL INFERENCE To this point, our specification and analysis of the regression model is semiparametric (see Section 16.3). We have not used Assumption A6 (see Table 4.1), normality of ε, in any of our results. The assumption is useful for constructing statistics for testing hypotheses. In (4-5), b is a linear function of the disturbance vector ε. If we assume that ε has a multivariate normal distribution, then we may use the results of Section B.10.2 and the mean vector and covariance matrix derived earlier to state that b | X ∼ N[β, σ 2 (X X)−1 ]. (4-8) This spec