Conventional econometric methodology is seriously deficient, because choice of regressor is not given importance. Omitting important regressor leads to nonesense regressions. Slides explain the problem, and provide a remedy
Choosing the Right Regressors
PIDE Nurturing Minds Seminar 13th Sept 2017
Asad Zaman, VC PIDE
Major Misunderstandings of Econometric
Three Papers Submitted to PDR:
(1): GDP = a + b1 X1 + … + bn Xn + C IPR = Intellectual Property Rights
(2): GDP = a + b1 X1 + … + bn Xn + C EXPORTS (Export Led Growth)
(3): GDP = a + b1 X1 + … + bn Xn + c1 FDI + c2 Lit (FDI + Human Capital)
Question: Can all three papers be right?
Axiom of Correct Specification
All INCLUDED Regressors MUST BE determinants.
All EXCLUDED Regressors MUST NOT BE DETERMINANTS
Both inclusions and exclusions MUST be correctly specified for the model to be valid.
ALL THREE REGRESSIONS CANNOT BE VALID:
If IPR is a determinant then other two are wrong, and similarly for the other (2)
SO at MOST one of the three can be correct.
BUT is one of the three correct? There are more than 60 variables AVAILABLE in WDI
data sets. So if ONLY one determinant, we have 60 possible regressions.
HOW CAN WE FIND OUT WHICH ONE IS CORRECT?
R-squared, AIC, BIC, Schwartz etc.
Use Model Selection Criteria --
This method does not work very well.
KEY PROBLEM: HOW TO ENSURE THAT THERE ARE NO MISSING SIGNIFICANT
VARIABLES – these cause EXTREME BIAS and WRONG AN MISLEADING RESULTS
INCLUDING EXTRA VARIABLES CAUSE LOSS OF EFFICIENCY, BUT DOES NOT CAUSE
INFERENCE FAILURE !! BECAUSE 0 coefficient is a possible value.
BIG MODEL MUST ENCOMPASS TRUE MODEL, if it is to produce VALID INFERENCE.
How Much Trouble Can a Missing Variable
CONS = Final consumption expenditure, etc. (constant 2010 US$) Pakistan
GDP = GDP (constant 2010 US$) Pakistan
CONS(Pak) = 4.12 + 0.883 * GDP (Pak) + (R2=0.998)
(0.51) (0.006) (2.56)
Standard Keynesian Consumption Function. Likely to have autocorrelation and other
MIS-SPECIFIED SHORT TERM DYNAMICS. Other small missing variables.
Super-Consistency Holds – Because of strong trends in CONS and GDP,
misspecifications will not matter. Bias due to omitted stationary variables will VANISH.
But omitted MAJOR variable can make a big difference.
What happens if we regress CONS(Pak)
on randomly chosen WDI Vars ???
SUR =Survival to age 65, female (% of cohort)
C02 =CO2 emissions from gaseous fuel consumption (% of total)
Obviously, these variables have no relation to Consumption. Nonetheless, the OLS
regression yields the following results
(E) CONS = -268.7 + 6.78 SUR – 1.82 CO2 + (R2=0.84)
(25.9) (0.73) (0.65) (20.0)
Both SUR and CO2 are HIGHLY SIGNIFICANT DETERMINANTS OF CONSUMPTION!
This is called a NONESENSE REGRESSION.
IRRELEVANT VARIABLES BECOME SIGNIFICANT AS PROXIES FOR MISSING VARS.
ADD Relevant Regressor GDP:
(F) CONS = 15.64 + 0.902 GDP – 2.60 SUR + 67.8 CO2 + (R2=0.99)
(0.54) (0.014) (1.40) (80.3) (2.28)
Both SUR and CO2 become insignificant
Nonsense Regressions are caused by MISSING DETERMINANTS
WRONG DIAGNOSIS MADE BY ECONOMETRICIANS:
Nonsense Regressions are cause by NON STATIONARITY – Immense amount of
literature on Integration and Co-Integration: COMPLETELY USELESS
Discovery due to Atiqur Rahman – He is supervising a thesis on this theme.
LESSON: Omitting Significant Regressor
Leads to NONESENSE REGRESSIONS
(G) CONS(Pakistan) = -13.44 + 11.07 GDP(Honduras) + (R2=0.99)
(1.15) (0.12) (4.34)
Consumption of Pakistan as a function of the GDP of Honduras !!
Standard Diagnosis – this is because CONS and GDP are not stationary. WRONG –
CONS and GDP do NOT have to be integrated variables. All we need is a MISSING
IMPORTANT DETERMINANT from the regression.
Are models (1), (2), (3) Correct?
HOW do we know if important variables are missing or not?
Maybe IPR is significant because it proxies for FDI or for Exports or for ANY of the
other 61 Variables available in WDI Data Set?
Edward Leamer: Fragility of Inference
First we study Leamer’s Solution – Extreme Bounds Analysis
Edward Leamer: Specification Search
The Truth about (regression) models
Models are NOT used to DISCOVER truth.
We start out KNOWING what we want to show.
We manipulate data into proving this
Example: Hundreds of papers proving free trade is beneficial
Rodrik: A Skeptic’s Guide.
Similar fact holds of models of economic theory.
Leamer: SPECIFICATION SEARCHES
By varying sets of variables, we can get ANY RESULT we like
Look at W as determinant of Y. Choose X1, X2, …, Xn
By choosing the right set of variables, we can make coefficient of W positive or
negative, significant or insignificant.
The PROCESS of REGRESSION is a specification search, where the Econometricians
looks for the right collection of regressors to prove his favorite hypothesis.
How to test if W is significant determinant
Choose FIXED RELEVANT VARIABLES (guaranteed to be important from THEORETICAL
consideration, known a priori ) X1, X2, …, Xn
Focus Variable W
Potential Determinants: V1, V2, … Vn
Regress Y of X1,..,Xn, W, and some combination of Vi
If W is significant regardless of what combination of Vi is put in, then W is significant.
Look at the range of possible values of estimated coefficient of W. This is called
EXTREME BOUNDS ANALYSIS.
Typical conclusion: NO VARIABLE IS SIGNIFICANT. ALL INFERENCE IS FRAGILE.
Sala-i-Martin: I ran two million regressions
VARIANT of Leamer’s EBA
Start with 62 Variables in WDI data set. Set Three as Essential Determinants
GDP60, LE60, PSE60 – Life Expectancy and Primary School Enrolment (Barro).
That leaves X1,…,X59. Choose Any One of them as W, Choose ANY THREE OTHERS to
Growth = c + b1 GDP60 + b2 LE60 + b3 PSE60 + c W + c1 Xi + c2 Xj + c3 Xk
VARY (i,j,k) over all possible sets of three regressors. 58x57x56= 185,136
If W is significant in 95% of these regressions, then count W as significant.
I count 10 million regressions here.
RESULT: 22 Variables out of the 59 are significant Conclusion EBA is too extreme
What is wrong with Sala-i-Martin?
Analysis is self-contradictory
If 22 variables are significant than ALL regressions with less than 22 regressors have
SIGNIFICANT OMITTED VARIABLES.
It follows that all of his two million regressions are nonsense regression.
CAN we get sensible results by running two million nonsense regressions?
Answer NO. This can be established by simulation study, done by Hoover and
Sala-i-Martin strategy has high Type I and II error probabilities. It can include
irrelevant regressors and exclude significant ones. Tends to include TOO MANY
variables as being relevant when they are NOT.
Pure Bayesian Approach
Fernandez, Ley, Steel (2001)
Take 41 regressors from Sala-t-Martin data set on which complete data is available. All
possible 2^41 – two trillion models. Assign priors to them.
Each regressor has prior probability 50% of being included in the model.
Compute posterior probabilities.
Regressors with HIGH posterior probabilities have high probabilities of being
Strongest determinants are: Confucian% -- GDP60, EquipInv, LE60
Many other determinants.
Good models have 0.1 % probability.
Model Averaging VERSUS Selection
Selection focuses on finding TRUE model and CORRECT regressors.
BMA aims to USE all models, assign them weights, and come up with combined
DEBATE AND CONTROVERSEY:
Can we average over wrong models and get right result?
RESOLUTION – There are DIFFERENT GOALS, and each procedure is well suited to
its OWN goal.
SELECTION involves putting all eggs in one basket. HIGHER RISK.
FORECASTING involves avoiding selection and getting insurance against bad
Hoover Perez Simulations
BMA fails to find the right regressors, BUT does well at forecasting.
So when it comes to CHOOSING the right set of regressor, the right strategy comes
from ENCOMPASSING, using the Hendry Methodology
Conventional Methodology leads to conflicting, contradictory theories and models
T!: IPR -- T2: FDI -- T3: Exports and many others – ALL theories describe
determinants of growth. They are in conflict with each other.
Papers exist which prove ELG, GLE, BOTH, NEITHER
Everybody runs a new regressions, and put down a new brick in a different place.
There is NO CUMULATION OF KNOWLEDGE.
Given T1, T2, T3, etc. New Researcher is NOT ALLOWED to put down T(J)
New Research MUST BUILD ON EXISTING RESEARCH.
FIRST of ALL do a LIT REVIEW – that is, COVER, and BE AWARE of ALL PRIOR
EXISTING LITERATURE ON YOUR TOPIC.
NEXT, explain the gap: What are the DEFECTS in existing theories?
NEXT, FILL the gap: Explain how and why T(J) is SUPERIOR to all existing theories.
At the END there should be ONLY ONE BEST THEORY – Encompassing shows that
our new theory COVERS all previous theories and IMPROVES UPON them. NEXT
researcher has to BEAT T(J) to produce T(J+!).
How to do this for Choice of Regressors
GUM: General Unrestricted Model
ADD ALL RELEVANT Variables
In our example, form model with Exports, IPR, FDI, and include ALL regressors used
by ALL the researchers. The GUM NESTS T1 T2 T3 as special cases:
GDP = a + b1 IPR + b2 FDI + b3 Exports + c1 X1 + … + ck Xk
T1 says that b2=0 and b3=0, T2 says that b1=0, b3=0, T3 says b1=b2=0
We can test these hypotheses using F-test for joint significance of multiple regressors.
Start with C = a + b GNP + error -- Start with Simple Model,
If there is a FLAW, THEN look for additional regressors – Make it more complicated if
What are FLAWS? Failures of standard assumptions
Heteroskedasticity (can usually be fixed by taking LOGS)
AUTOCORRELATION: Can be fixed by adding DYNAMICS to static equation
GUM Strategy for Autocorrelation
Suppose C = a + b Y + e has autocorrelated errors.
Then: e(t) = C(t) – a – b Y(t). ALSO: e(t-1) = C(t-1) – a – b Y(t-1)
AUTOCORRELATED MODEL IS e(t) = u(t) + r e(t-1)
C(t) = a + b Y(t) + u(t) + r e(t-1)
= a + b Y(t) + r C(t-1) –ra – rb Y(t-1) + u(t)
= (a - ra) + b Y(t) + r C(t-1) – rb Y(t-1) + u(t)
Consider the GENERAL ARDL model – This is General UNRESTRICTED Model
C = a* + b Y + c C(-1) + d Y(-1) + e
AR-1 model is special case with d = - bc, a* = a (1-r)
Flaws of Simple to General Strategy
If regression equation does not forecast well (Y=a+bX) add relevant variable W.
Then W may appear significant because it is proxy for some other missing variable.
This will DECEIVE the econometrician.
If we add AR-1 restriction, we SET d = - bc. GeTS says ALLOW UNRESTRICTED d,
and THEN TEST RESTRICTION.
GeTS: General-To-Simple Modeling
GeTS: Build the largest passible model. INCLUDE ALL POTENTIALLY RELEVANT
REGRESSORS. Now no regressor can be significant because of OMITTED
VARIABLES. Because you have included them ALL
Assuming we have data on ALL relevant variables
In the Sala-i-Martin data, run regression on ALL 61 variables.
THEN DROP insignificant Variables.
Multiple Objections to GeTS
With lots of regressors, we have MULTICOLLINEARITY problems.
Many important regressors will fail to be significant.
NOISE can exceed SIGNAL. Bad Regressors can drive out Good ones.
MANY PROBLEMS HAVE BEEN RESOLVED.
MUCH PROGRESS HAS BEEN MADE
EXISTING ALGORITHMS GIVE fairly GOOD probabilities of FINDING A MODEL WHICH
ENCOMPASSES the true model.
That is around 80% chances of picking up all relevant regressors, plus one or two
extras. (depending on configurations of model and regressors)
GeTS is NOT a mechanical procedure.
MUST be guided by KNOWLEDGE
IDEAL CASE for GeTS
ALL regressors are ORTHOGONAL – that is INDEPENDENT.
Then each regressor can be treated separately, they do not INTERFERE with each other.
Arrange all the t-stats for significance in decreasing ORDER. DROP all t-stats LESS than
some critical value.
There is no model selection problem! That is significance will NOT be affected by
MODEL selection in this situation.
Much more difficult with CORRELATED
Y = a + b X + c W1 + d W2
IDEAL situation: Good Regressor PRESENT, makes, Bad Regressor INSIGNIFICANT.
In first regression Pak Cons on Female Mortality and C02 Emissions, if we put in Pak
GNP, it makes other two variables INSIGNIFICANT.
THEORETICALLY, this will ALWAYS happen ASYMPTOTICALLY – as we get larger and
larger amounts of DATA (and MODEL does not CHANGE) Good Variables WILL DRIVE
out bad Variables.
PRACTICALLY THIS IS NOT GUARANTEED. Often working with small data sets. EVEN
with BIG DATA, if model changes from time to time than all data sets are small.
Pak Cons = a + b Pak GNP + c Honduras GNP + error
Honduras GNP REMAINS highly significant in this regression.
What does this mean? Does Honduras GNP matter for Pakistani Consumption?
No – it is acting as a proxy for some OTHER missing variable
This KNOWLEDGE comes from our knowledge of the real world. THAT is why model
selection cannot be mechanical.
Software Package PC-GeTS
Automatic Model Selection
Automatic GeTS is implements in PC-GeTS package, available and USEFUL.
It reaches correct models with high probability when sufficient data is available to
VARIABLE regressors are quickly spotted, those with low variation MAY BE MISSED.
The more the correlation, the greater possibility for ERROR – wrong variable can be
chosen instead of right one.
Human Guided Model Selection
MODELLING CORRECTION: Start by ensuring a GOOD GUM – that is, fulfill
assumptions of regression model. Choose right functional forms (log or other).
LINEARIZE relationships, and run a lot of different types of FIXES to put initial model
into GOOD Shape BEFORE starting selection.
TRY TO ORTHOGONALIZE REGRESSORS:
C = a GNP(t) + b GNP(t-1) can be changed to C = a GNP + b ∆ GNP
Y = a0 + a1 X1 + … + a60 X60
CAN Test EVEN when regressors exceed observations !
MAIN IDEAS drop ALL INSIGNIFICANT REGRESSORS. Works if regressors are
independent. But not if they are correlated. In this case, Good regressor may be
insignificant, and Bad Regressor may be significant. What to DO?
CHOOSE USING THEORY OVER EMPIRICS: Retain Theoretically Important Variables.
Take the TEN least significant variables. DROP them ONE at a TIME. This creates 10
DIFFERENT searches All variables are retained in one of the ten searches.
Compare TERMINAL models using BIC
Continue EACH of ten searches by eliminating the LEAST significant regressors.
MAY BE GUIDED BY THEORY AT EACH STAGE
Choose among collection of FINAL models.
This selection need not be mechanical
Can also use PRINCIPAL COMPONENTS to extract a small number of highly
variable regressor from a large set. But Problems arise in INTERPRETATION.
TWO STEPS: Building a Good Regression Model (not being deceived by
ACCIDENTAL CORRELATIONS and SPURIOUS and NONESENSE REGRESSIONS)
Picking out GENUINE, STABLE CORRELATIONS DOES NOT JUSTIFY CAUSAL
GUM IDENTIFIES % CONFUCIAN as a KEY DETERMINANT of growth
Because of CHINA. NOT A CAUSAL RELATIONSHIP
To view 70m video-talk based
on these slides, a brief
summary, and link to full paper,