Successfully reported this slideshow.
Upcoming SlideShare
×

# Choosing the Right Regressors

77 views

Published on

Conventional econometric methodology is seriously deficient, because choice of regressor is not given importance. Omitting important regressor leads to nonesense regressions. Slides explain the problem, and provide a remedy

Published in: Data & Analytics
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

### Choosing the Right Regressors

1. 1. Econometric Methodology: Choosing the Right Regressors PIDE Nurturing Minds Seminar 13th Sept 2017 Asad Zaman, VC PIDE
2. 2. Major Misunderstandings of Econometric Methodology Three Papers Submitted to PDR: (1): GDP = a + b1 X1 + … + bn Xn + C IPR = Intellectual Property Rights (2): GDP = a + b1 X1 + … + bn Xn + C EXPORTS (Export Led Growth) (3): GDP = a + b1 X1 + … + bn Xn + c1 FDI + c2 Lit (FDI + Human Capital) Question: Can all three papers be right?
3. 3. Axiom of Correct Specification Edward Leamer All INCLUDED Regressors MUST BE determinants. All EXCLUDED Regressors MUST NOT BE DETERMINANTS Both inclusions and exclusions MUST be correctly specified for the model to be valid. ALL THREE REGRESSIONS CANNOT BE VALID: If IPR is a determinant then other two are wrong, and similarly for the other (2) SO at MOST one of the three can be correct. BUT is one of the three correct? There are more than 60 variables AVAILABLE in WDI data sets. So if ONLY one determinant, we have 60 possible regressions. HOW CAN WE FIND OUT WHICH ONE IS CORRECT?
4. 4. R-squared, AIC, BIC, Schwartz etc. Use Model Selection Criteria --  This method does not work very well. KEY PROBLEM: HOW TO ENSURE THAT THERE ARE NO MISSING SIGNIFICANT VARIABLES – these cause EXTREME BIAS and WRONG AN MISLEADING RESULTS INCLUDING EXTRA VARIABLES CAUSE LOSS OF EFFICIENCY, BUT DOES NOT CAUSE INFERENCE FAILURE !! BECAUSE 0 coefficient is a possible value. BIG MODEL MUST ENCOMPASS TRUE MODEL, if it is to produce VALID INFERENCE.
5. 5. How Much Trouble Can a Missing Variable Cause? CONS = Final consumption expenditure, etc. (constant 2010 US\$) Pakistan  GDP = GDP (constant 2010 US\$) Pakistan  CONS(Pak) = 4.12 + 0.883 * GDP (Pak) +  (R2=0.998) (0.51) (0.006) (2.56) Standard Keynesian Consumption Function. Likely to have autocorrelation and other MIS-SPECIFIED SHORT TERM DYNAMICS. Other small missing variables. Super-Consistency Holds – Because of strong trends in CONS and GDP, misspecifications will not matter. Bias due to omitted stationary variables will VANISH. But omitted MAJOR variable can make a big difference.
6. 6. What happens if we regress CONS(Pak) on randomly chosen WDI Vars ??? SUR =Survival to age 65, female (% of cohort) C02 =CO2 emissions from gaseous fuel consumption (% of total)  Obviously, these variables have no relation to Consumption. Nonetheless, the OLS regression yields the following results (E) CONS = -268.7 + 6.78 SUR – 1.82 CO2 +  (R2=0.84) (25.9) (0.73) (0.65) (20.0) Both SUR and CO2 are HIGHLY SIGNIFICANT DETERMINANTS OF CONSUMPTION! This is called a NONESENSE REGRESSION. IRRELEVANT VARIABLES BECOME SIGNIFICANT AS PROXIES FOR MISSING VARS.
7. 7. ADD Relevant Regressor GDP: (F) CONS = 15.64 + 0.902 GDP – 2.60 SUR + 67.8 CO2 +  (R2=0.99) (0.54) (0.014) (1.40) (80.3) (2.28) Both SUR and CO2 become insignificant Nonsense Regressions are caused by MISSING DETERMINANTS WRONG DIAGNOSIS MADE BY ECONOMETRICIANS: Nonsense Regressions are cause by NON STATIONARITY – Immense amount of literature on Integration and Co-Integration: COMPLETELY USELESS Discovery due to Atiqur Rahman – He is supervising a thesis on this theme.
8. 8. LESSON: Omitting Significant Regressor Leads to NONESENSE REGRESSIONS EXAMPLE: (G) CONS(Pakistan) = -13.44 + 11.07 GDP(Honduras) +  (R2=0.99) (1.15) (0.12) (4.34) Consumption of Pakistan as a function of the GDP of Honduras !! Standard Diagnosis – this is because CONS and GDP are not stationary. WRONG – CONS and GDP do NOT have to be integrated variables. All we need is a MISSING IMPORTANT DETERMINANT from the regression.
9. 9. ORIGINAL QUESTION: Are models (1), (2), (3) Correct?  HOW do we know if important variables are missing or not?  Maybe IPR is significant because it proxies for FDI or for Exports or for ANY of the other 61 Variables available in WDI Data Set? Edward Leamer: Fragility of Inference First we study Leamer’s Solution – Extreme Bounds Analysis
10. 10. Edward Leamer: Specification Search The Truth about (regression) models  Models are NOT used to DISCOVER truth.  We start out KNOWING what we want to show.  We manipulate data into proving this  Example: Hundreds of papers proving free trade is beneficial  Rodrik: A Skeptic’s Guide.  Similar fact holds of models of economic theory.
11. 11. Leamer: SPECIFICATION SEARCHES  By varying sets of variables, we can get ANY RESULT we like  Look at W as determinant of Y. Choose X1, X2, …, Xn  By choosing the right set of variables, we can make coefficient of W positive or negative, significant or insignificant.  The PROCESS of REGRESSION is a specification search, where the Econometricians looks for the right collection of regressors to prove his favorite hypothesis.
12. 12. How to test if W is significant determinant of Y? Choose FIXED RELEVANT VARIABLES (guaranteed to be important from THEORETICAL consideration, known a priori ) X1, X2, …, Xn Focus Variable W Potential Determinants: V1, V2, … Vn Regress Y of X1,..,Xn, W, and some combination of Vi If W is significant regardless of what combination of Vi is put in, then W is significant. Look at the range of possible values of estimated coefficient of W. This is called EXTREME BOUNDS ANALYSIS. Typical conclusion: NO VARIABLE IS SIGNIFICANT. ALL INFERENCE IS FRAGILE.
13. 13. Sala-i-Martin: I ran two million regressions VARIANT of Leamer’s EBA Start with 62 Variables in WDI data set. Set Three as Essential Determinants GDP60, LE60, PSE60 – Life Expectancy and Primary School Enrolment (Barro). That leaves X1,…,X59. Choose Any One of them as W, Choose ANY THREE OTHERS to run: Growth = c + b1 GDP60 + b2 LE60 + b3 PSE60 + c W + c1 Xi + c2 Xj + c3 Xk VARY (i,j,k) over all possible sets of three regressors. 58x57x56= 185,136 If W is significant in 95% of these regressions, then count W as significant. I count 10 million regressions here. RESULT: 22 Variables out of the 59 are significant Conclusion EBA is too extreme
14. 14. What is wrong with Sala-i-Martin?  Analysis is self-contradictory  If 22 variables are significant than ALL regressions with less than 22 regressors have SIGNIFICANT OMITTED VARIABLES.  It follows that all of his two million regressions are nonsense regression.  CAN we get sensible results by running two million nonsense regressions?  Answer NO. This can be established by simulation study, done by Hoover and Perez later.  Sala-i-Martin strategy has high Type I and II error probabilities. It can include irrelevant regressors and exclude significant ones. Tends to include TOO MANY variables as being relevant when they are NOT.
15. 15. Pure Bayesian Approach Fernandez, Ley, Steel (2001) Take 41 regressors from Sala-t-Martin data set on which complete data is available. All possible 2^41 – two trillion models. Assign priors to them. Each regressor has prior probability 50% of being included in the model. Compute posterior probabilities. Regressors with HIGH posterior probabilities have high probabilities of being determinants. Strongest determinants are: Confucian% -- GDP60, EquipInv, LE60 Many other determinants. Good models have 0.1 % probability.
16. 16. Model Averaging VERSUS Selection  Selection focuses on finding TRUE model and CORRECT regressors.  BMA aims to USE all models, assign them weights, and come up with combined forecast.  DEBATE AND CONTROVERSEY:  Can we average over wrong models and get right result?  RESOLUTION – There are DIFFERENT GOALS, and each procedure is well suited to its OWN goal.  SELECTION involves putting all eggs in one basket. HIGHER RISK.  FORECASTING involves avoiding selection and getting insurance against bad choices.
17. 17. Hoover Perez Simulations  BMA fails to find the right regressors, BUT does well at forecasting.  So when it comes to CHOOSING the right set of regressor, the right strategy comes from ENCOMPASSING, using the Hendry Methodology
18. 18. Hendry Methodology  Conventional Methodology leads to conflicting, contradictory theories and models  T!: IPR -- T2: FDI -- T3: Exports and many others – ALL theories describe determinants of growth. They are in conflict with each other.  Papers exist which prove ELG, GLE, BOTH, NEITHER  Everybody runs a new regressions, and put down a new brick in a different place.  There is NO CUMULATION OF KNOWLEDGE.
19. 19. SOLUTION: ENCOMPASSING  Given T1, T2, T3, etc. New Researcher is NOT ALLOWED to put down T(J)  New Research MUST BUILD ON EXISTING RESEARCH.  FIRST of ALL do a LIT REVIEW – that is, COVER, and BE AWARE of ALL PRIOR EXISTING LITERATURE ON YOUR TOPIC.  NEXT, explain the gap: What are the DEFECTS in existing theories?  NEXT, FILL the gap: Explain how and why T(J) is SUPERIOR to all existing theories.  At the END there should be ONLY ONE BEST THEORY – Encompassing shows that our new theory COVERS all previous theories and IMPROVES UPON them. NEXT researcher has to BEAT T(J) to produce T(J+!).
20. 20. How to do this for Choice of Regressors  GUM: General Unrestricted Model  ADD ALL RELEVANT Variables  In our example, form model with Exports, IPR, FDI, and include ALL regressors used by ALL the researchers. The GUM NESTS T1 T2 T3 as special cases: GUM: GDP = a + b1 IPR + b2 FDI + b3 Exports + c1 X1 + … + ck Xk T1 says that b2=0 and b3=0, T2 says that b1=0, b3=0, T3 says b1=b2=0 We can test these hypotheses using F-test for joint significance of multiple regressors.
21. 21. Conventional Methodology Simple-To-General Start with C = a + b GNP + error -- Start with Simple Model, If there is a FLAW, THEN look for additional regressors – Make it more complicated if necessary. What are FLAWS? Failures of standard assumptions Heteroskedasticity (can usually be fixed by taking LOGS) AUTOCORRELATION: Can be fixed by adding DYNAMICS to static equation
22. 22. GUM Strategy for Autocorrelation Suppose C = a + b Y + e has autocorrelated errors. Then: e(t) = C(t) – a – b Y(t). ALSO: e(t-1) = C(t-1) – a – b Y(t-1) AUTOCORRELATED MODEL IS e(t) = u(t) + r e(t-1) C(t) = a + b Y(t) + u(t) + r e(t-1) = a + b Y(t) + r C(t-1) –ra – rb Y(t-1) + u(t) = (a - ra) + b Y(t) + r C(t-1) – rb Y(t-1) + u(t) Consider the GENERAL ARDL model – This is General UNRESTRICTED Model C = a* + b Y + c C(-1) + d Y(-1) + e AR-1 model is special case with d = - bc, a* = a (1-r)
23. 23. Flaws of Simple to General Strategy  If regression equation does not forecast well (Y=a+bX) add relevant variable W.  Then W may appear significant because it is proxy for some other missing variable. This will DECEIVE the econometrician.  If we add AR-1 restriction, we SET d = - bc. GeTS says ALLOW UNRESTRICTED d, and THEN TEST RESTRICTION. .
24. 24. GeTS: General-To-Simple Modeling  GeTS: Build the largest passible model. INCLUDE ALL POTENTIALLY RELEVANT REGRESSORS. Now no regressor can be significant because of OMITTED VARIABLES. Because you have included them ALL  Assuming we have data on ALL relevant variables  In the Sala-i-Martin data, run regression on ALL 61 variables.  THEN DROP insignificant Variables.
25. 25. Multiple Objections to GeTS  With lots of regressors, we have MULTICOLLINEARITY problems.  Many important regressors will fail to be significant.  NOISE can exceed SIGNAL. Bad Regressors can drive out Good ones. MANY PROBLEMS HAVE BEEN RESOLVED. MUCH PROGRESS HAS BEEN MADE EXISTING ALGORITHMS GIVE fairly GOOD probabilities of FINDING A MODEL WHICH ENCOMPASSES the true model. That is around 80% chances of picking up all relevant regressors, plus one or two extras. (depending on configurations of model and regressors)
26. 26. GeTS is NOT a mechanical procedure. MUST be guided by KNOWLEDGE  IDEAL CASE for GeTS ALL regressors are ORTHOGONAL – that is INDEPENDENT. Then each regressor can be treated separately, they do not INTERFERE with each other. Arrange all the t-stats for significance in decreasing ORDER. DROP all t-stats LESS than some critical value. There is no model selection problem! That is significance will NOT be affected by MODEL selection in this situation.
27. 27. Much more difficult with CORRELATED regressors  Y = a + b X + c W1 + d W2 IDEAL situation: Good Regressor PRESENT, makes, Bad Regressor INSIGNIFICANT. In first regression Pak Cons on Female Mortality and C02 Emissions, if we put in Pak GNP, it makes other two variables INSIGNIFICANT. THEORETICALLY, this will ALWAYS happen ASYMPTOTICALLY – as we get larger and larger amounts of DATA (and MODEL does not CHANGE) Good Variables WILL DRIVE out bad Variables. PRACTICALLY THIS IS NOT GUARANTEED. Often working with small data sets. EVEN with BIG DATA, if model changes from time to time than all data sets are small.
28. 28. COMPLICATIONS Pak Cons = a + b Pak GNP + c Honduras GNP + error Honduras GNP REMAINS highly significant in this regression. What does this mean? Does Honduras GNP matter for Pakistani Consumption? No – it is acting as a proxy for some OTHER missing variable This KNOWLEDGE comes from our knowledge of the real world. THAT is why model selection cannot be mechanical.
29. 29. Software Package PC-GeTS Automatic Model Selection Automatic GeTS is implements in PC-GeTS package, available and USEFUL. It reaches correct models with high probability when sufficient data is available to discriminate. VARIABLE regressors are quickly spotted, those with low variation MAY BE MISSED. The more the correlation, the greater possibility for ERROR – wrong variable can be chosen instead of right one.
30. 30. Human Guided Model Selection MODELLING CORRECTION: Start by ensuring a GOOD GUM – that is, fulfill assumptions of regression model. Choose right functional forms (log or other). LINEARIZE relationships, and run a lot of different types of FIXES to put initial model into GOOD Shape BEFORE starting selection. TRY TO ORTHOGONALIZE REGRESSORS: C = a GNP(t) + b GNP(t-1) can be changed to C = a GNP + b ∆ GNP
31. 31. Multiple Searches Y = a0 + a1 X1 + … + a60 X60 CAN Test EVEN when regressors exceed observations ! MAIN IDEAS drop ALL INSIGNIFICANT REGRESSORS. Works if regressors are independent. But not if they are correlated. In this case, Good regressor may be insignificant, and Bad Regressor may be significant. What to DO? CHOOSE USING THEORY OVER EMPIRICS: Retain Theoretically Important Variables. Take the TEN least significant variables. DROP them ONE at a TIME. This creates 10 DIFFERENT searches All variables are retained in one of the ten searches.
32. 32. Compare TERMINAL models using BIC  Continue EACH of ten searches by eliminating the LEAST significant regressors. MAY BE GUIDED BY THEORY AT EACH STAGE  Choose among collection of FINAL models.  This selection need not be mechanical  Can also use PRINCIPAL COMPONENTS to extract a small number of highly variable regressor from a large set. But Problems arise in INTERPRETATION.
33. 33. FINAL REMARKS  TWO STEPS: Building a Good Regression Model (not being deceived by ACCIDENTAL CORRELATIONS and SPURIOUS and NONESENSE REGRESSIONS)  Picking out GENUINE, STABLE CORRELATIONS DOES NOT JUSTIFY CAUSAL INFERENCE>  GUM IDENTIFIES % CONFUCIAN as a KEY DETERMINANT of growth  WHY?  Because of CHINA. NOT A CAUSAL RELATIONSHIP
34. 34. To view 70m video-talk based on these slides, a brief summary, and link to full paper, see: http://bit.do/azreg