SlideShare a Scribd company logo
1 of 137
Download to read offline
Estimation & Inference Under Non-Standard Conditions
by
Nicky Grant
Robinson College
Dissertation submitted to
the University of Cambridge
for the degree of
Doctor of Philosophy
Supervisor: Professor Richard J. Smith
Faculty of Economics
c August 2013
ii
Dedicated entirely to my Nanna Lilly,
Without whom little of this would have been possible.
ii
Declaration
I hereby declare that this dissertation is the result of my own work,
includes nothing which is the outcome of work done in collaboration
except where specifically stated in the text and is not substantially
the same as any other work that I have submitted or will be sub-
mitting for a degree or diploma or other qualification at this or any
other university, and does not exceed the prescribed word limit of
60,000 words.
Chapter 2 is a slightly condensed version of the paper published as Grant,
Nicky. ‘Overcoming the Many Weak Instrument Problem Using Normalized
Principal Components.’ Advances in Econometrics 29 (2012): 107-147.
Chapter 3 is a based on a paper co-authored with Richard J. Smith based
on an earlier working paper named ‘Estimation & Inference from Uncon-
ditional Moment Inequality Restrictions Models Estimated via GMM and
Generalized Empirical Likelihood’.
Nicky Grant
6th August 2013
iii
iv
Summary
This dissertation studies identification of some unknown parameter from a
set of moment conditions, covering both inequality and equality restrictions.
Chapter 1 considers identification robust inference from the inversion of the
Generalised Anderson-Rubin Statistic (GAR) based on a χ2
m approximation
where m is the number of moment conditions. This method is known to pro-
vide valid inference under a set of assumptions including the moment variance
be non-singular at the true parameter θ0, e.g Stock & Wright (2000). This
assumption is shown to be untenable for many forms of identification failure
in non-linear models, as noted for a class of regression models in Andrews &
Cheng (2012). They overcome the issue of singularity for asymptotic analy-
sis by a restrictive assumption that the moment variance be non-singular up
to a particular matrix of model parameters. To provide results for general
forms of identification failure a novel asymptotic approach is developed based
on higher order asymptotic expansions of the eigensystem of the moment
variance around θ0. Without reference to an assumption moment variance
singularity takes a known form the GAR statistic is shown to possess a χ2
m
limit under additional regularity conditions when moments are singular that
are currently known in the literature. One such condition requires the null
space of the moment variance lie within that of the outer product of the ex-
pected first order derivative at θ0. When this condition is violated the GAR
statistic is shown to be Op(n) and is termed the ‘moment-singularity bias’. A
simulation experiment demonstrates this bias for a IV Linear Simultaneous
Equations example. When this condition is almost violated the simulation
shows the GAR statistic may be very oversized even for large sample sizes.
v
Summary
Chapter 2 provides a method of ordering and selecting instruments so as to
minimise the many weak instrument bias in linear IV settings. A potential
flaw of the commonly used Principal Component (PC) method of instrument
reduction is demonstrated. In light of this a new method is derived termed
‘Normalised Principal Components’ (NPC). This method provides a set of in-
struments with a corresponding asymptotically valid ranking in terms of their
correlation with the endogenous variable. This instrument set and ordering
is then used to select instruments by minimising the MSE approximations
of Donald & Newey (2001). Favourable small sample properties of the IV
estimator based on this technique relative to PC methods are demonstrated
in a simulation. Finally the NPC method is applied to the Vietnam War
Draft IV setup of Angrist & Krueger (1992). Fourteen NPC’s are shown to
have a non-zero correlation with education (p < 0.1) and 2SLS(and related)
estimators based on such instruments estimate the returns to schooling to be
much lower than that of both OLS and 2SLS with all instruments.
Chapter 3 studies inference from unconditional moment inequalities, an area
of research which is growing in popularity in the econometrics literature.
Specifically the properties of a GEL-based estimator of the identified set from
a set of moment inequalities are derived. To do so the results presented in
Chernozhukov Hong & Tamer (2007) [CHT] based on a GMM type estimator
from a set of moment inequalities are extended by dropping the assumption
that the weight matrix is (asymptotically) diagonal. This assumption though
seemingly innocuous is critical to the results and proofs of this paper. The
GEL objective function is then shown on the identified set to be first order
asymptotically equivalent to that of GMM with weight matrix equal to the
inverse of the sample moment variance. Using this result consistency of the
GEL estimator for the identified set and rate of convergence in the Hausdorff
Distance along with the requisite regularity conditions are established.
vi
Acknowledgments
Would like to thank my supervisor Richard J. Smith for his help and guidance
over the past 4 years. Also my research advisor Hashem Pesaran who has
provided useful advice and tips for reading which has enriched the content
of this dissertation. I would also like to acknowledge the helpful discussions
in the graduate office over the years- especially with Manasa Patnam and
Steve Thiele. Also conference participants at the World Congress of the
Econometrics Society meetings in Shanghai (2010) and the 11th Advances
in Econometrics Conference ‘Essays in Honor of Jerry Hausman’ for useful
comments related to some of the material in this dissertation..
I would also like to thank all the participants at my Job Market Presenta-
tions where I received a lot of positive and helpful feedback. Namely seminar
participants at University of Manchester, University of Pompeu Fabra, Uni-
versity of St Gallen, Bilkent University, New Economic School, University
of New South Wales and the University of Bristol. Also comments from
my practise job market talk at the University of Cambridge and at the Euro-
pean Winter Meetings 2012 which greatly enriched the quality of my final job
market presentations and paper. In particular Melvyn Weeks, Alastair Hall,
Barbara Rossi, Oliver Linton, Majid Al-Sadoon, SeoJeong Lee and Daniel
Buncic.
Finally I would like to acknowledge the ERSC funding which financed my
PhD studies from 2009-2012.
vii
viii
Definitions
Statistical Definitions:
E[x]- Mathematical expectation of x with respect to the density of x.
E[x|y] - Mathematical expectation of x with respect to the distribution of x
conditional on y.
p
→ - Convergence in probability
d
→ - Convergence in distribution
d
→- Weak convergence → - For any deterministic sequence an then an → b
denotes b as the deterministic limit of an.
d
∼ - Shorthand for ‘ is distributed as’
w.p.a.1- With probability approaching 1
w.p.1 - With probability 1
op(a) - A variable that converges to zero w.p.a.1 when divided by a
Op(a) -A variable bounded in probability when divided by a
a.s(z) - Refers to ‘almost surely’ with respect to the distribution z
Matrix Definitions:
Let A,B refer to arbitrary matrices, C an arbitrary vector and a,b two arbi-
trary real numbers.
A = 0 - All entries of A equal to 0
Rank(A) - Rank of A
A−1
-Inverse of A
A−
- Moore-Penrose Generalised Inverse (AA−
A = A)
Null(A)- Null Space of A
ix
Definitions
tr(A) - Trace of A
||A|| - Euclidean Norm (tr(A A)1/2
)
B(C, )- A ball around C where > 0
⊗- Kronecker Product
A B = A − B (set subtraction)
diag(A)- Diagonal matrix formed from the diagonal entries of A
[A]ij- ij th element of A
Ia×a- a × a Identity Matrix
0a×a - An a × a matrix of zeroes
0a - a × 1 vector of zeroes
a ∧ b = max{a, b}
a− = min{a, 0}
a+ = max{a, 0}
a − = a−
Abbreviations
p.s.d - Positive Semi-Definite
p.d- Positive Definite
f.c.r- Full column rank
w.r.t- With respect to
s.t- Such that
iff- If and only if
CMT- Continuous Mapping Theorem
T- Triangle Inequality
CS- Cauchy-Schwartz Inequality
M-Markov Inequality
UWL - will denote a uniform weak law of large numbers such as Lemma 2.4
of Newey and McFadden (1994)
CLT - Lindeberg-Levy Central Limit Theorem.
KWLLN- Khinctine Weak Law of Large Numbers
MSE- Mean Squared Error
x
IV-Instrumental Variables
GMM-Generalised Method of Moments
(G)EL- (Generalised) Empirical Likelihood
2SLS-Two Stage Least Squares
OLS- Ordinary Least Squares
PC- Principal Components
xi
xii
Contents
Declaration iii
Summary v
Acknowledgments vii
Definitions ix
1 Identification Robust Inference with Singular Variance 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Identification and Singular Variance . . . . . . . . . . . . . . . 5
1.2.1 Conditional Moments . . . . . . . . . . . . . . . . . . . 6
Case (i): E[∂ρ(θ)/∂θ |z] = ∂ρ(θ)/∂θ . . . . . . . . . . 6
Case (ii) E[∂ρ(θ)/∂θ |z] = ∂ρ(θ)/∂θ . . . . . . . . . . 7
1.2.2 Examples of Singular Variance . . . . . . . . . . . . . . 9
Singular Variance : Null(Ω) ⊆ Null(GG ) . . . . . . . . 9
Singular Variance: Null(Ω) Null(GG ) . . . . . . . . 11
1.3 Matrix Perturbation Theory . . . . . . . . . . . . . . . . . . 12
1.3.1 Asymptotic Eigensystem Expansions . . . . . . . . . . 14
1.4 Generalized Anderson Rubin Statistic with Singular Variance 16
1.4.1 Simulation : Heckman Selection . . . . . . . . . . . . . 19
1.4.2 Moment-Singularity Bias when Null(Ω) Null(GG ) . 21
Simulation : Linear IV Simultaneous Equations . . . . 21
1.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
1.6 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
xiii
CONTENTS
2 Overcoming The Many Weak Instrument Problem Using Nor-
malized Principal Components 37
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.2 Instrument Selection Methods . . . . . . . . . . . . . . . . . . 41
2.2.1 MSE Approximations of Donald & Newey (2001) . . . 42
2.2.2 The Regularization MSE Approach . . . . . . . . . . . 43
2.3 Principal Components Ranking of Instruments . . . . . . . . 45
2.3.1 Problem With The PC Method of Instrument Reduction 46
2.4 Normalized Principal Components . . . . . . . . . . . . . . . 48
2.4.1 Sample NPC Method . . . . . . . . . . . . . . . . . . 49
2.5 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
2.5.1 Simulation Results . . . . . . . . . . . . . . . . . . . . 54
2.6 Application to Angrist & Krueger (1992) . . . . . . . . . . . . 56
2.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
2.8 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
2.9 Appendix A: Implementing NPC Method . . . . . . . . . . . . 69
2.9.1 R Code for NPC Instrument Selection . . . . . . . . . 71
3 GEL-Based Inference with Unconditional Moment Inequal-
ity Restrictions 75
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
3.2 Moment Inequality Restrictions . . . . . . . . . . . . . . . . . 77
3.3 GMM and GEL . . . . . . . . . . . . . . . . . . . . . . . . . . 78
3.3.1 GMM . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
3.3.2 GEL . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
3.3.3 Identified Set . . . . . . . . . . . . . . . . . . . . . . . 82
3.4 Set Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 83
3.4.1 GMM . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
3.4.2 GEL . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
Appendix A: Assumptions . . . . . . . . . . . . . . . . . . . . 86
Appendix B: Preliminary Lemmas . . . . . . . . . . . . . . . . 87
xiv
Appendix C: Proofs for GMM . . . . . . . . . . . . . . . . . . 90
Appendix D: Proofs for GEL . . . . . . . . . . . . . . . . . . . 93
D.1 GEL Estimator Equivalence . . . . . . . . . . . . . . . 93
D.2 Asymptotics for GEL . . . . . . . . . . . . . . . . . . . 95
Appendix E: Identified Set . . . . . . . . . . . . . . . . . . . . 100
Bibliography 105
xv
xvi
List of Figures
2.1 NPC Decomposition of Proportion of Variation in Education
(130 Instruments) . . . . . . . . . . . . . . . . . . . . . . . . . 60
2.2 MSE Approximation Ordered by NPCs(130 Instruments) . . . 61
2.3 NPC Decomposition of Proportion of Variation in Education
(360 Instruments) . . . . . . . . . . . . . . . . . . . . . . . . . 64
2.4 MSE Approximation Ordered by NPCs(360 Instruments) . . . 65
xvii
xviii
List of Tables
1.1 GAR Rejection Probabilities: Heckman Selection . . . . . . . 20
1.2 rGAR Rejection Probabilities: Heckman Selection . . . . . . . 20
1.3 GAR Rejection Probabilities ¯π = 0 (Moment-Singularity Bias) 23
1.4 GAR Rejection Probabilities ¯π = 0.1 (Moment-Singularity Bias) 24
1.5 GAR Rejection Probabilities ¯π = 0.5 (Moment-Singularity Bias) 25
2.1 First Stage PC Coefficients (π2
pcj) . . . . . . . . . . . . . . . . 53
2.2 Eigenvalues of Principal Components(λj) . . . . . . . . . . . . 53
2.3 Variation in xi explained by Principal Components (π2
pcjλj) . . 54
2.4 Simulation Results: NPC vs. PC Instrument Selection . . . . 56
2.5 NPC First Stage Regression Coefficients (130 instruments) . . 59
2.6 NPC First Stage Regression Coefficients (360 instruments) . . 62
2.7 Estimates of Returns to Education . . . . . . . . . . . . . . . 63
xix
xx
Chapter 1
Identification Robust Inference
with Singular Variance
This chapter studies identification robust inference when moments have sin-
gular variance at the true parameter θ0. Existing robust methods assume
non-singular moment variance at θ0 up to a particular known matrix of pa-
rameters, Andrews and Cheng (2012). This is shown to restrict the class
of identification failure for which current results on robust methods hold.
General conditions under which the GAR statistic has a χ2
m limit distribu-
tion are derived utilizing second order asymptotic eigensystem expansions of
the sample variance matrix around θ0. This method prevents the necessity
of restrictive assumptions on the rank and form of the population variance
along sequences converging the true parameter. A crucial condition for this
result requires that the null space of the moment variance lies within that
of the outer product of the expected first order derivative at θ0. When
this condition is violated the GAR statistic is Op(n), which is termed the
‘moment-singularity bias’. Empirically relevant examples of this problem are
provided and the bias verified in a simulation.
Keywords: Generalized Anderson Rubin Statistic, Identification Failure,
Singular Variance, Non-linear models, Matrix Perturbation Theory.
1
Chapter 1
1.1 Introduction
Identification robust methods of inference have gained increasing prominence
in the econometrics literature in the last decade. Broadly its objective has
been to provide asymptotically valid methods of inference on some unknown
parameter θ0 robust to failures of either global or first-order identification. A
substantive part of this literature derives confidence sets containing θ0 with
asymptotically correct probability inverting a pre-specified test statistic over
a parameter space.
A large part of this literature has focussed on Linear Instrumental Variable
(IV) settings with its roots in the work of Anderson and Rubin (1949). A now
sizeable literature has developed providing alternative procedures aiming to
make as few possible assumptions to justify asymptotically valid inference
on θ0, including but not limited to Kleibergen (2002,2005), Moreira (2003),
Chernozhukov & Hansen (2008), Kleibergen & Mavroeidis (2009), Magnus-
son (2010), Guggenberger et al (2012).
General non-linear moment functions have received relatively little attention
in this literature, a notable exception1
being the GAR statistic of Newey
and Windmeijer (2009). Also known as the Continuous Updating Estimator
(CUE) statistic, Guggenberger, Ramalho and Smith (2005) and confidence
regions based on the GAR statistic defined as ‘S-Sets’ in Stock and Wright
(2000) .
Let wi (i = 1, .., n) be an independent and identically distributed (i.i.d)
data set with a known m × 1 moment function g(w, θ) satisfying the mo-
ment condition E[g(wi, θ)] = 0 at the true parameter θ0 ∈ Θ ⊆ Rp
. Define
the sample moment function and corresponding variance matrix respectively
ˆg(θ) := 1
n
n
i=1 gi(wi, θ), ˆΩ(θ) := 1
n
n
i=1 gi(wi, θ)gi(wi, θ) . The GAR statis-
tic is defined
ˆTGAR(θ) := nˆg(θ) ˆΩ(θ)−1
ˆg(θ)
1
The K-Statistic of Kleibergen (2005) also permits general non-linear moment func-
tions, however the proof of asymptotic validity does not adequately account for singular
variance in the transformed moment function considered. This issue is beyond the scope
of this chapter, however the author intends to work on this in future research
2
Identification Robust Inference with Singular Variance
Under a set of assumptions including the asymptotic moment variance Ω :=
E[g(wi, θ0)g(wi, θ0) ] is non-singular ˆTGAR(θ0) converges in distribution to χ2
m
(e.g Stock and Wright (2000)). The majority of the literature on identifica-
tion robust inference makes no explicit assumption of first order identifi-
cation. Namely that G := E[Gi(θ0)] is full column rank where Gi(θ) :=
∂g(wi, θ)/∂θ .
The impetus for this chapter stems from the fact that Ω must be singular
when G is not full rank for a class of non-linear moment functions, including
single equation Non-Linear Least Squares and Maximum Likelihood. This
result has mainly gone unmentioned in the identification robust literature. In
light of this issue current results in the identification robust literature justify
valid inference for a restricted class of identification failure, limited largely
to linear models.
An exception is (Cheng (2008), Andrews and Cheng (2012)) who note the
link between identification failure and singular variance for a particular form
of identification failure in semi-linear regression models. Cheng (2008) de-
rives the limit distribution of the Non-Linear Least Squares (NLS) estimator
for such models. Using this result the distribution of the t, Wald and Quasi-
Likelihood Ratio (QLR) statistic are evaluated and methods of identification
robust inference are proposed based on such statistics.
All three papers overcome the issue of singularity of Ω arising from identifi-
cation failure for asymptotic analysis by an assumption that the form of the
singularity is known up to a matrix of model parameters. The class of iden-
tification failure (and hence singular variance) that satisfy this assumption is
shown to be restrictive, being difficult to motivate outside of the particular
examples of identification failure studied in both papers.
This chapter differs from Andrews and Cheng (2012) in two ways. (i) Condi-
tions which the GAR statistic is asymptotically χ2
m are provided for general
forms of identification failure requiring no assumptions on the form of mo-
ment singularity. (ii) To achieve (i) the GAR statistic is expanded around θ0
via second order asymptotic expansions of the eigensystem of ˆΩ(θ)−1
. This
method is of interest in its own right and would prove useful extending results
for other identification robust statistics and estimators to allow for general
3
Chapter 1
forms of identification failure.
Second order asymptotic expansions of the eigenvectors of ˆΩ(θ) around θ0 are
derived borrowing results from Matrix Perturbation Theory with its roots in
Kato (1982). This field has not readily made it in to the mainstream econo-
metric literature- exceptions being Ratsimalahelo (2002) who consider tests
of matrix rank, Moon and Weidner (2010) derive expansions of the Quasi
Maximum Likelihood profile function for panel data models and Hassani et
al (2011) use such expansions for Singular Spectral Analysis.
Utilizing this result general second order eigenvalue expansions of ˆΩ(θ) around
θ0 are established. Specific expansions under an i.i.d assumption (along with
requisite regularity conditions) are then derived. These eigensystem expan-
sions will prove useful when extending the results of this chapter to non-i.i.d
settings and are new in the identification literature.
In order for the result (i) to hold further conditions on Gi(θ) and ˆΩ(θ) at θ0
are required when considering general forms of identification failure. A key
condition requires those δ ∈ Rm
such that δ Ω = 0 imply δ G = 0 (i.e the
null space of Ω is a subset of that of GG ). For example this rules out singu-
lar variance when the strong identification conditions hold in just-identified
models. In this case the GAR statistic is shown to be bounded in probabil-
ity of order n. This issue currently unknown in the literature is termed the
‘moment-singularity bias’.
Simulation evidence demonstrates this bias in a Linear IV Simultaneous
Equation setup. The small sample approximation of the GAR statistic by a
χ2
m distribution is shown to be poor when the null space of Ω almost does
not lie within that of GG (i.e when δ Ω ≈ 0 and δ G = 0). In this case the
GAR statistic is shown to be oversized even for large sample sizes.
Numerous examples of singular variance for commonly used moment func-
tions are provided including financial econometric models and Non-Linear IV
Simultaneous Equations. Many cases where the assumption on the form of
the singularity in Andrews and Cheng (2012) is violated are provided.
Section 1.2 explores the relationship between G and Ω for conditional moment
restrictions. Section 1.3 sets out the asymptotic approach, deriving second or-
der asymptotic expansions of the eigensystem of ˆΩ(θ) and specific expansions
4
Identification Robust Inference with Singular Variance
in the case wi is i.i.d. Section 1.4 provides conditions under which the GAR
statistic is asymptotically locally χ2
m and explains the ‘moment-singularity
bias’. An extensive simulation study is also provided demonstrating the
main results of this chapter. Section 1.5 presents conclusions and directions
for further research. An Appendix collects proofs of the main theorems.
1.2 Identification and Singular Variance
The link between identification failure and singular variance is not a new idea
in the identification robust literature. Andrews and Cheng (2012) provide
asymptotic results under the assumption that there exists B(θ),
B(θ) = diag(Im∗×m∗ , ι(θ)I¯m× ¯m) (1.1)
Where m∗
= Rank(Ω) and ¯m = m − m∗
, ι(θ) = ||θ|| such that,
B(θn)−1 ˆΩ(θn)B(θn)−1 p
→ ¯Ω (1.2)
For all θn = θ0+∆n where ||∆n|| > 0 and ||∆n|| = op(n−1/2
) and Rank(¯Ω) = m.
They derive asymptotic properties of (functions) of general extremum estima-
tors working with the transformed moment function B(θn)−1
√
nˆg(θn) where
asymptotic singularity of
√
nˆg(θn) is eradicated. Once the moment function
is transformed the limit variance is non-singular and standard asymptotic
analysis is feasible. The existence of such a matrix B(θ) satisfying (1.2) is
restrictive, being difficult to motivate generally outside of piecewise linear
models with particular forms of identification failure, see Section 1.2.2.
Section 1.2.1 studies the relationship between G and Ω more generally from
moment conditions derived from a system of conditional moment restrictions.
Conditions under which Null(Ω) ⊆ Null(GG ) are derived for general non-
linear models with arbitrary forms of identification failure. As demonstrated
in Section 1.4.2 this condition turns out to be crucial for ˆTGAR(θn) to be
bounded in probability with a χ2
m limit distribution. Empirically relevant
examples are given where this condition does not hold in Section 1.2.2.
5
Chapter 1
1.2.1 Conditional Moments
Consider a J×1 residual function ρ(θ) := ρ(x, θ) where ρ(·, ·) : X×Θ −→ RJ
,
x ∈ X ⊆ Rl
with a h × 1 instrument z satisfying,
E[ρ(θ)|z] = 0 at θ = θ0 (1.3)
Broadly speaking there are two types of moment function derived from (1.3)
depending upon whether E[∂ρ(θ)/∂θ |z] must be estimated beforehand.
Namely whether (i) E[∂ρ(θ)/∂θ |z] = ∂ρ(θ)/∂θ a.s(z) for example Non-
Linear Least Squares (NLS) and Unconditional Maximum Likelihood (MLE)
where x = z. For ML ρ(θ) is the likelihood function and if the moemnt
moment condition used to form the GAR is from the score E[∂ρ(θ)/∂θ ] = 0
at θ = θ0 when MLE is correctly specified case the variance equals the Fisher
Information Matrix. The issue would also arise in Quasi-Maximum Likeli-
hood estimation if V ar(∂ρ(θ)/∂θ ) is singular at the pseudo-true parameter
θ = θ∗
. The derivation of the distribution of the (Q)MLE statistic evaluated
at points near to θ∗
is beyond the scope of this paper.
Alternatively case (ii) E[∂ρ(θ)/∂θ |z] = ∂ρ(θ)/∂θ for z with measure greater
than zero, for example non-linear instrumental variables where generally
z = x.
Case (i): E[∂ρ(θ)/∂θ |z] = ∂ρ(θ)/∂θ
Define D(θ, z) := E[∂ρ(θ)/∂θ |z] and Ωρ(θ, z) = E[ρ(θ)ρ(θ) |z]. In the i.i.d
setting the optimal instrument is D(θ0, z)Ωρ(θ0, z)−1
, Newey (1993).
Take the case J = 1 forming the moment g(θ) = D(θ, z)ρ(θ),
Ω = E[ρ(θ0)2
D(θ0, z)D(θ0, z) ]
G = E[D(θ0, z)D(θ0, z) ]
Hence for any δ ∈ Rp
such that δ Ωδ = 0 implies
E[ρ(θ0)2
(δ D(θ0, z))2
] = 0
6
Identification Robust Inference with Singular Variance
δ D(θ0, z) = 0 a.s(z). Therefore δ Gδ = E[(δ D(θ0, z))2
] = 0. The reverse is
also simple to establish, so that Null(Ω) ≡ Null(G) ≡ Null(GG ). First order
under-identification and singular variance are equivalent for single equation
NLS. This result may break down for J ≥ 2 if Ωρ(θ0, z) is singular a.s(z)
existing cases where the null space of GG and Ω are not equivalent2
.
Proposition 1: For g(θ) = D(θ, z)ρ(θ)
Null(Ω) ⊆ Null(GG ) iff δ ∈ Rp
such that D(θ0, z) δ ∈ Null(Ωρ(θ0), z)/0
a.s(z)
Proof For δ = 0, δ Ωδ = E[δ D(θ0, z)Ωρ(θ0, z)D(θ0, z) δ] = 0 iff ∃δ ∈ Rp
such that D(θ0, z) δ lies in the null space of Ωρ(θ0, z) a.s(z) since δ G = 0 iff
δ D(θ0, z) = 0 a.s(z).
Q.E.D
Case (ii) E[∂ρ(θ)/∂θ |z] = ∂ρ(θ)/∂θ
Commonly when D(θ0, z) is not known a priori the fact that (1.3) implies
the following moment condition for any m × 1 Z := (φ1(z), .., φm(z)) where
{φj(.) : j = {1, .., m}} are arbitrary functions of z (e.g polynomials in z up
to order m),
E[ρ(θ) ⊗ Z] = 0 at θ = θ0
For example the Consumption Capital Asset Pricing Model moment condi-
tions in Stock and Wright (2000). In this case
G = E[D(θ0, z) ⊗ Z]
Ω = E[Ωρ(θ0, z) ⊗ ZZ ]
Where G is an mJ × p matrix and Ω is mJ × mJ. In this case in general the
null space of Ω and GG are not necessarily linked. Given that Z includes
2
Note a similar result can also be shown based utilizing an estimate of the optimal
instrument based on an a consistent estimator of a generalized inverse Ωρ(θ0, z)−
noting
that the Rank(Ωρ(θ0, z)) = Rank(Ωρ(θ0, z)−
).
7
Chapter 1
no linearly redundant combinations of instruments then Ω may be less than
full rank only when Ωρ(θ0, z) is not full rank a.s(z). Define δ := (δ1, .., δJ )
where δj ∈ Rm
for j = {1, .., m}.
Proposition 2: Null(Ω) ⊆ Null(GG ) iff δ ∈ RmJ
where(δ1Z, .., δJ Z) ∈
Null(Ωρ(θ0, z)) a.s(z) such that δ ∝ ν for some ν ∈ RmJ
s.t ν G = 0
Proof: For δ = 0 then δ Ωδ = E[(δ1Z, .., δJ Z)Ωρ(θ0, z)(δ1Z, .., δJ Z) ] hence
δ Ω = 0 iff (δ1Z, .., δJ Z) ∈ Null(Ωρ(θ0, z)) a.s(z). The null space of Ω will
not lie in that of GG iff ∃δ ∈ RmJ
such that (δ1Z, .., δJ Z) ∈ Null(Ωρ(θ0, z))
where δ ∝ ν for some ν ∈ RmJ
where ν G = 0.
Q.E.D
Remarks:
(i) When Ωρ(θ0, z) is homoscedastic (i.e Ωρ(θ0, z) = Ωρ a.s(z) for some
p.s.d symmetric m × m matrix Ωρ) then it is straightforward to show that
Rank(Ω) = mr where r = Rank(Ωρ).
(ii)If for any function a(.) of z ∃ π ∈ Rm
such that
E[(π Z − a(z))2
] → 0 (1.4)
For m → ∞ then Rank(Ω) ≤ mJ − r∗
(as m → ∞) where r∗
= J −
Rank(Ωρ(θ0, z)) a.s(z). Since by (1.4) there will exist at least r∗
linearly
independent vectors δ ∈ RmJ
s.t (δ1Z, .., δJ Z) can be expressed as some
linear combination of elements of the null space of Ω(θ0, z) a.s(z) for m
large.
Especially a concern is (ii) as even if ρ(θ0) has no perfectly correlated (linear
combination of) elements (E[Ωρ(z, θ0)] is full rank), Ω will be singular for m
large when there exists perfect conditional correlation in elements of ρ(θ0)
(i.e r∗
> 0). This would violate the condition for GAR to be asymptotically
χ2
m. An example of this case is provided Example 3 in Section 1.2.2 with a
corresponding simulation provided in Section 1.4.2.
8
Identification Robust Inference with Singular Variance
1.2.2 Examples of Singular Variance
This sections provides examples of moment functions with singular variance
both with and without identification- specifically when the condition that
Null(Ω) ⊆ Null(GG ) holds or does not.
Singular Variance : Null(Ω) ⊆ Null(GG )
A class of identification failure satisfying Null(Ω) ⊆ Null(GG ) is the stochas-
tic semi-linear parametric equations (for J = 1) considered in Cheng (2008)3
.
y = α x + πf(z, γ) +
Where θ = (α, γ, π), α ∈ Rq
, π ∈ R, γ ∈ Rl
and f(·, ·) : Rd
× Rl
→ R is a
continuously differentiable function.
Let w = (y, x, z) where y is a scalar random variable, x is q ×1 and z is d×1
where E[ |x, z] = 0 at θ = θ0 for some parameter vector θ0 = (α0, γ0, π0).
Define f(γ) := f(z, γ), (θ) := y − α x − πf(γ),
∂ (θ)
∂θ
= (x, f(γ), π∂f(γ)/∂γ)
Then the moment function utilized in NLS is
g(θ) = (θ)(x, f(γ), π∂f(γ)/∂γ)
Under the i.i.d assumption the variance of the moments at any θ ∈ Θ is
Ω(θ) = E (θ)2



xx f(γ)x πx∂f(γ)/∂γ
f(γ)x πf(γ)2
πf(γ)∂f(γ)/∂γ
π∂f(γ)/∂γx πf(γ)∂f(γ)/∂γ π2
∂f(γ)/∂γ∂f(γ)/∂γ



Ω would be singular in the following three cases (and potentially others),
3
Cheng (2008) allow for a vector of non-linear functions though for simplicity this
special case is highlight to demonstrate the infeasibility of the assumption on the form of
the singular variance made in both Cheng (2008) and Andrews and Cheng (2012).
9
Chapter 1
(i) θ0 = (α, γ, 0) for any (α, γ) ∈ Rq+l
.
(ii) f(γ0) = δ x for some δ ∈ Rq
.
(iii) δ1∂f(γ0)/∂γ = δ2x for some δ1 ∈ Rl
and δ2 ∈ Rq
. where ||δ1|| > 0.
Case (i) falls under the assumption of Andrews and Cheng (2012). Namely
for the matrix B(θ) =
I2×2 02
02 π
then B(θ)−1
Ω(θ)B(θ)−1
is no longer a
function of π. In this case singularity cased by π0 = 0 is removed. However
there exist no matrix of the form B(θ) that will remove the singularity for
cases (ii) and (iii) and more generally for arbitrary forms of singularity that
depend upon the Data Generating Process.
Example 1: Heckman Selection Consider a Heckman Selection Re-
gression where f(z, γ) = φ(z γ)/Φ(−z γ) is the Inverse Mills Ratio and z
corresponds to variables which govern sample selection. If z γ0 = c for some
constant c and x includes a constant then singularity arises from (ii). Even
if this condition does not hold, as noted by Puhani (2000) and others the
Inverse Mills Ratio is approximately linear for a wide range of γ. In this case
if x and z contain coinciding variables then NLS would be weakly identified
with almost singular variance.
Example 2 provides a case of a general non-linear moment function where
Null(Ω) ⊆ Null(GG ). Also note that in this case there exists no matrix B(θ)
satisfying (1.2).
Example 2: Interest Rate Dynamics
r − r−1 = a(b − r−1) + σrγ
Where r−1 is the first lag of the interest rate r. Define θ = (a, b, σ, γ). Under
the assumption that is stationary at θ = θ0 where θ0 = (a0, b0, σ0, γ0)
then using the test-function approach of Hansen and Scheinkman (1995) the
10
Identification Robust Inference with Singular Variance
following moment function is derived in Jagannathan and Wang (2002),
g(θ) =






a(b − r)r−2γ
− γσ2
r−1
a(b − r)r−2γ+1
− (γ − 1
2
)σ2
(b − r)r−a
− 1
2
σ2
r2γ−a−1
a(b − r)r−σ
− 1
2
σ3
r2γ−σ−1






satisfying E[g(θ)] = 0 at θ = θ0.
When σ0 = a0, γ0 = 1/2(a0 + 1) or γ0 = 1/2(σ0 + 1) redundant moments
exist at the true parameter. For example if all three conditions held simulta-
neously the rank of Ω be 1 as there would exist only one linearly independent
comibination in g(θ).
Singular Variance: Null(Ω) Null(GG )
Common causes of singular variance arise from a lack of identification. It is
however plausible that singular variance occurs where Null(Ω) Null(GG ),
for example in just-identified settings when G is full rank (first-order identi-
fied) though Ω is singular.
Example 3: IV Simultaneous Equations
Consider an example of a conditional moment restriction where J = 2,
ρ1(θ0) = h1(z)
ρ2(θ0) = h2(z)
Where E[ 2
|z] = 1 and h1(z) and h2(z) are the conditional heteroscedasticity
for equations 1 and 2 respectively. Let Z be an m × 1 vector function of z
used as instruments.
Let δ = (δ1, δ2) where δ1, δ2 ∈ Rm
then
δ Ωδ = E[h1(z)(δ1Z)2
] + E[h2(z)(δ2Z)2
] + 2E[ h1(z)h2(z)δ1Zδ2Z]
For example if δ1Z = 1/ h1(z), δ2Z = −1/ h2(z) then Ω is singular. In
the case where h1(z) = h2(z) then any δ1, δ2 ∈ Rm
where δ1Z = −δ2Z would
11
Chapter 1
yield δ Ωδ = 0. This is an example of Proposition 2 and in general δ Ω = 0
does not imply δ G = 0. Take for example
ρ1(θ) = y1 − θ1x1
ρ2(θ) = y2 − θ2x2
Where θ = (θ1, θ2), x = (y1, y2, x1, x2) with instrument vector Z = (1, z).
Assuming E[x1|z] = ¯π(1 + z), E[x2|z] = −¯π(1 + z2
) and z ∼ N(0, 1) it is
straightforward to establish,
G =
¯π(1, 1) 02
02 ¯π(−2, 0)
If h1(z) = h2(z) then δ1 = (c, 0), δ2 = (−c, 0) for c = 0 imply δ Ω = 0
however δ G = (c¯π, 2c¯π) = 0 when ¯π = 0. Note that if instruments were
irrelevant (¯π = 0) then δ G = 0 for all directions δ ∈ R4
.
Though the example here is somewhat pathological (requiring ρ1(θ0), ρ2(θ0)
be perfectly correlated) the problem extends also to the case where no equa-
tions are perfectly correlated, i.e h1(z) = h2(z)).
For example if h1(z) = exp(−ζ1z) and h2(z) = exp(−ζ2z) (where ζ1 = ζ2) if
Z includes polynomial orders of z up to m then δ1 and δ2 such that δ1z =
1+1/2ζ1z +...+(1/2ζ1z)m
/m! and δ2z = −(1+1/2ζ2z +...+(1/2ζ2z)m
/m!)
will well approximate 1/ h1(z) and −1/ h2(z) respectively for m large.
When using many instruments (and/or with J large) it is entirely plausible
there exist directions in which δ Ω = 0 that do not imply δ G=0.
1.3 Matrix Perturbation Theory
Section 1.4 derives conditions under which ˆTGAR(θn) converges in distribution
to a χ2
m limit for any local sequence θn = θ0 + ∆n−δ
( hence ∆n = ∆n−δ
)
where ∆j = 0 ∀j = {1, .., p}. This is the asymptotic approach in Bottai
(2003) and others. This could be generalised to allow ∆ to be a potentially
random variable such that n−δ
∆n
d
→ ∆ and allow for different rates of con-
12
Identification Robust Inference with Singular Variance
vergence. This would however not change or add to the fundamental result
in Theorem 1. Crucially we model each parameter as perturbed away from
its true value, where the perturbation may be made infinitesimally though
never zero (i.e any finite δ > 0 and ∆j = 0 for all j = {1, .., p}). For ex-
ample if one parameter in θ0 leads to singularity, then if parameter is not
perturbed the matrix will be singular irrespective of perturbations to the
remaining parameters. And again, if we can establish that the inverting the
GAR statistic with a χ2
m covers an infinitesimally small region around all
parameters this ensures it covers θ0. This method is used to establish local
uniform coverage in other papers as mentioned for example by Bottai (2003).
Even if θ0 were not a point of singularity we would still wish to establish that
this local coverage condition and is not an assumption made to deal with po-
tential singularity. Nor is it an assumption the true parameter is a drifting
sequence- though it could be interpreted this way if desired without loss of
generality. The large simple distribution of the GAR is derived without an
assumption the form of the singularity is known. To do so the GAR statistic
at θn is expanded around the point of singularity θ0, requiring second order
expansions of the eigensystem of ˆΩ(θn) around θ0. This section is concerned
with deriving these expansions.
Firstly definitions for the eigensystem of the functional matrix Ω(θ) and ˆΩ(θ)
are outlined. By construction both matrices are p.s.d and symmetric hence
the following decompositions can be made for all θ ∈ Θ. Let the m×m matrix
P(θ) be the matrix of population eigenvalues where Ω(θ) = P(θ)Λ(θ)P(θ)
Such that P(θ) P(θ) = Im and Λ(θ) contains the eigenvalues of Ω across
the diagonal and zeros on the off-diagonal. Define the rank of Ω(θ) as
m − ¯m(θ) where 0 ≤ m(θ) ≤ m. Express P(θ) = (P+(θ), P0(θ)) and
Λ(θ) =
Λ+(θ) 0
0 Λ0(θ)
where Λ+(θ) is an (m− ¯m(θ))×(m− ¯m(θ)) diag-
onal matrix with the non-zero eigenvalues of Ω(θ) on the diagonal with cor-
responding eigenvector matrix P+(θ). Λ0(θ) = 0¯m(θ)× ¯m(θ) with corresponding
eigenvector matrix P0(θ). Performing an eigenvalue decomposition re-write
13
Chapter 1
Ω(θ) as
Ω(θ) = P+(θ)Λ+(θ)P+(θ) + P0(θ)Λ0(θ)P0(θ)
Performing a similar decomposition for ˆΩ(θ)
ˆΩ(θ) = ˆP+(θ)ˆΛ+(θ) ˆP+(θ) + ˆP0(θ)ˆΛ0(θ) ˆP0(θ)
Where ˆP+(θ) is an (m − ¯m(θ)) × (m − ¯m(θ)) matrix of sample eigenvector
estimates of P+(θ) with corresponding sample eigenvalue ˆΛ+(θ). ˆP0(θ) and
ˆΛ0(θ) are similarly the sample estimates of P0(θ) and Λ0(θ) respectively let-
ting ˆP(θ) := ( ˆP+(θ), ˆP0(θ)).
Define Ω = Ω(θ0) and ˆΩ = ˆΩ(θ0) and ¯m(θ0) := ¯m for notational simplic-
ity throughout and let the eigenvalues/vector matrices of both Ω and ˆΩ be
defined without θ0, for example P := P(θ0), ˆP := ˆP(θ0) and so on.
1.3.1 Asymptotic Eigensystem Expansions
Borrowing results from the Matrix Perturbation literature second order ex-
pansions of the eigenvectors of ˆΩ(θn) are derived, Hassani et al. (2011).
Using this result second order expansions of the eigenvalues around θ0 are
established. These results for the sample moment variance matrix are new
in the literature and of interest in their own right.
Assumption 1 (A1): General Eigensystem Expansions
(i) c ≤ [Λ+]jj ≤ K for some 0 < c ≤ K < ∞ ∀ j = {1, .., ¯m}, (ii)
||ˆΩ(θ) − ˆΩ(θ∗
)|| ≤ ˆM||θ − θ∗
|| ∀θ, θ∗
∈ Θ for some ˆM = Op(1), (iii) m < ∞
A1(i) is a relatively trivial condition which assumes the non-zero eigenvalues
are well separated from zero and bounded. A2(ii) requires an asymptotic
Lipschitz condition on the sample variance matrix. A3(iii) is an assump-
tion of a finite number of moments which is made for simplicity, all results
could readily be extended to allow m → ∞ with appropriate rate restrictions
relative to n.
14
Identification Robust Inference with Singular Variance
Define Ω+ = P+Λ+P+ and Ω∗
+ = P+Λ−1
+ P+
Theorem 1 (T1): General Eigensystem expansions of
Under A1,A2
ˆP+(θn) = P+ + Op(||ˆΩ − Ω|| ∧ ||∆n||) (1.5)
ˆΛ+(θn) = Λ+ + Op(||ˆΩ − Ω|| ∧ ||∆n||) (1.6)
ˆP0(θn) = P0 − Ω∗
+
ˆΩ(θn)P0 + Op((||∆n|| ∧ ||ˆΩ − Ω||)2
) (1.7)
ˆΛ0(θn) = P0
ˆΩ(θn)P0 − P0
ˆΩ(θn)Ω∗
+
ˆΩ(θn)P0 + Op((||∆n|| ∧ ||ˆΩ − Ω||)3
) (1.8)
Second order expansions for the eigenvectors/values corresponding to non-
zero eigenvalues are also provided in Lemma A2. As shown in Section 1.4
second order terms in ˆΛ+(θn), ˆP+(θn) do not enter first order asymptotics
for ˆTGAR(θn) these results are omitted here for brevity. Theorem 2 provides
expansions of the eigensystem of ˆΩ(θn) around θ0 under an i.i.d assumption
on wi with corresponding regularity conditions.
Assumption 2 (A2) : i.i.d Eigensystem Expansions
(i) wi(i = 1, .., n) is an i.i.d sequence, (ii) E[||gi||2
] < ∞,(iii) 1
n
n
i=1 ||Gi(θ)−
Gi(θ∗
)|| ≤ ˆM||θ − θ∗
|| ∀θ, θ∗
∈ Θ where ˆM = Op(1), (iv) E[||Gi||2
] < ∞,
(v)E[gi(θ)] = 0 at θ = θ0.
A2(i) is made largely for simplicity, all results could be extended to allow
for dependence and heteroscedasticity under further regularity conditions.
A2(iii) requires that for n large enough the average of any elements of Gi(θ) is
sufficiently continuous. This is a weaker condition than Gi(θ) is continuous,
though a sufficient condition for A2(ii) is that Gi(·) satisfies the Lipschitz
condition. A2 (ii), (iv) are both required such that the remainder terms in
the eigensystem expansions are bounded.
For any arbitrary sequence ∆n where ||∆n|| > 0, ||∆n|| = op(n−1/2
) define
¯∆n = ||∆n||−1
∆n where ¯∆n
p
→ ∆ where ||∆|| > 0 and is bounded. Define
15
Chapter 1
gi := gi(θ0), Gi := Gi(θ0) and the following4
Γ = P0E[Gi∆∆ Gi]P0,Ψ =
P0E[Gi∆gi], Φ := Γ − ΨΩ∗
+Ψ .
Theorem 2 (T2): i.i.d Eigensystem Expansions
Under A1, A2
ˆP+(θn)
p
→ P+ (1.9)
ˆΛ+(θn)
p
→ Λ+ (1.10)
||∆n||−1
( ˆP0(θn) − P0)
p
→ Ω∗
+Ψ (1.11)
||∆n||−2 ˆΛ0(θn)
p
→ Φ (1.12)
1.4 Generalized Anderson Rubin Statistic with
Singular Variance
This section derives conditions under which the GAR statistic has a χ2
m limit
distribution making no assumption on the form of singularity.
The GAR statistic ˆTGAR(θ) does not exist at θ = θ0 when Ω is singular.
However when Φ is full rank then the GAR statistic exists (w.p.1) since
||∆n||−2 ˆΛ0(θn) = Φ + op(1) by Theorem 2 where (||∆n||−2 ˆΛ0(θn))−1
needs to
exist for TGAR(θn) to exist (w.p.1). 5
.
Assumption 3 (A3) : Limit Distribution of GAR Statistic
(i) Null(Ω) ⊆ Null(GG ), (ii) Φ is p.d.
A3(i) is a crucial condition needed for the GAR statistic to have the stan-
dard χ2
m limit distribution. Note that this assumption always holds for NLS
4
For simplicity w omit dependence of Γ, Ψ on the arbitrary limit ∆.
5
Note that this is not an assumption that the true parameter is a sequence converging
to θ0 at some rate, merely that we are evaluating the distribution of TGAR(θ) at points
arbitrarily close to θ0. Using these results the true parameter could be modeled as some
sequence converging to a limit θ0 which is commonly used to model certain forms of weak-
identification in the literature, for example Stock and Wright (2000), Andrews and Cheng
(2012).
16
Identification Robust Inference with Singular Variance
where J = 1 by the results in Section 1.2.1. When A3(i) is violated the
GAR statistic in general is Op(n) as shown in Theorem 4. This is termed the
‘moment-singularity bias’ in Section 1.4.2.
A3(ii) is required for ˆTGAR(θn) to exist w.p.1 when the function does not exist
at θ0 due to a singularity in Ω. Φ = P0(E[Gi∆∆ Gi]−E[Gi∆gi]Ω∗
+E[gi∆ Gi])P0
is p.s.d and in will not be p.d unless P0gi(θ) = 0 for θ = θn. Note that by
definition of θn is a perturbation to each element of θ0. Singular variance
usually occurs at a point, if Ω(θ0) is singular then Ω(θn) where θn perturbs
every element of θ0. For example in y = β0xγ0
+ then singularity in the mo-
ment variance occurs at β0 = 0. At β0 = 0 the moment variance is singular
for all γ ∈ R. However if we perturb β0 by some small amount the moment
variance is non-singular at this point for any γ. It is difficult to think of
examples where singularity exists at a space within B with volume greater
than zero. Example 1,2 and 3 all have singular variance occurring at a point
θ0 where the variance is non-singular at some perturbation away from θ0.
Theorem 3 (T3): Under A1, A2,A3
ˆTGAR(θn)
d
→ χ2
m (1.13)
Remarks
(i) Note that in the standard case where Ω is assumed to be non-singular, A2
(iii), (iv) and A3 are not made. In this case all that is required to establish
(1.13) is
√
nˆg(θn)
d
→ N(0, Ω) which holds under A2(i),(ii) and that ˆΩ(θ) is
(asymptotically) continuous around θ0 which follows from A1(ii). It is then
straightforward to show that ˆTGAR(θn)
d
→ χ2
m .
(ii) When Ω is singular second order terms in the eigensystem expansions of
ˆΩ(θn)−1
enter first order asymptotics. As such second order terms in
√
nˆg(θn)
impact first order asymptotics, requiring further regularity conditions on the
first order derivative.
(iii) Though theoretically ˆTGAR(θn) is asymptotically χ2
m for θn arbitrarily
close but not equal (element by element ) to θ in practise a regularization
may need to be used. When Ω is singular then ˆΩ(θn) has smallest eigenvalues
17
Chapter 1
of order Op(||∆n||2
). Take ∆ = c/n for c = 0 then for large sample sizes
computationally using numerical software number od order 10−x
for x over a
certain threshold are rounded down to zero. In this case we the GAR statistic
evaluated using numerical software returns a warning that the statistic does
not exist. GAR with a regularised estimate of the variance ˆΩ∗
(θ) which
drops those below some vanishing threshold would overcome this problem if
the threshold was selected large enough to not encounter the precision error.
Regularised GAR Staristic (rGAR)
In order to overcome the practical issue of the GAR statistic not existing
due to rounding imprecision of infinitesimally small numbers a regularised
statistic may be preferred in practise. Define J = {1, .., ¯m} i.e those j such
that λj > 0 and Jc the compliment of J. Given |∆n|| = op(n−1/2
) then
by Theorem 2 nˆλj(θn) = op(1) for j ∈ Jc and nˆλj(θn)
p
→ ∞ for j ∈ J.
Then we can estimate J w.p.1 since Pr{nˆλj(θn) > K} → 1 for all j ∈ J
where Pr{nˆλj(θn) > K} → 1 for j ∈ J. In practise the practitioner sets the
threshold K and estimate J by ˆJ := {j ∈ (1, .., m) : nˆλj(θn) > K} then it is
straightforward to show by the results above that Pr{ˆJ = J}. We can then
regularize the variance matrix ˆΩ∗
(θn) = |ˆJ|
j=1
ˆPj(θn) ˆPj(θn) ˆλj(θn) where |ˆJ|
is the dimension of ˆJ which is our estimate of ¯m, the rank of Ω based on the
regularisation.
T∗
GAR(θn) = nˆg(θn) ˆΩ∗
(θn)ˆg(θn) (1.14)
Where n1/2
ˆg(θn)
d
→ N(0, Ω) under A1,A2 and given Pr{ˆJ = J} → 1 then
ˆΩ∗
(θn) = ¯m
j=1
ˆPj(θn) ˆPj(θn) ˆλj(θn) w.p.1 hence ˆΩ∗
(θn) is full rank w.p.1.
since ˆλj > 0 for all j ∈ J. Then ˆΩ∗
(θn)
p
→ Ω and nˆg(θn) ˆΩ∗−1
(θn)ˆg(θn)
p
→ χ2
¯m.
Then inference can be based on rGAR where now the limit distribution is
χ2
¯m not χ2
m. Theorem 3 is confirmed in a simulation based on the Heckman
Selection example in Section 1.4.1. In this case the crucial assumption that
Null(Ω) ⊆ Null(GG ) holds as this is NLS with J = 1. The GAR statistic
when Ω is near singular returns NaN (not a number) some of the time in light
18
Identification Robust Inference with Singular Variance
of the imprecision of numerical software. As such inference is also provided
based on the rGAR where it is confirmed that T∗
GAR(θn) is asymptotically
χ2
¯m where ¯m may be estimated w.p.1 as n tends towards infinite using |ˆJ|.
1.4.1 Simulation : Heckman Selection
Consider the setup in Example 1 where
y = θ1 + θ2x + θ3
φ(θ4 + θ5x)
Φ(−(θ4 + θ5x))
+
Where (x, e) are i.i.d and x ∼ N(0, 8)2
and |x ∼ N(0, 1).
Setting θ0 = (1, 1, 0.2, 0.1, κ) where θn = (1, 1, 0.2, 0.1, κ + 1/n) for κ =
{0, 0.5, 1}, N = {100, 500, 1000, 5000, 50000}.
For κ close to zero NLS is poorly identified as the Inverse Mills Ratio is
approximately linear for arguments less than 2, Puhani (2000). Rejection
probabilities for the event that the GAR function at θn is less than the 90%
quantile of a χ2
5 based on R = 10000 simulations are calculated. In brackets
the percentage of warnings returned (no number reported indicates no warn-
ings returned) in R when calculating the inverse of the variance matrix are
reported. When κ = 0 and hence moments are completely unidentified with
exactly singular variance then though GAR exists at θ0 + 1/n , though the
smallest eigenvalue is Op(|∆n||2
) and due rounding approximations used by
computational software may yield zero eigenvalues in practise. The rejection
frequency depends crucially on the variation in x. If x has small variability
then Ω is close to rank 3 and at κ = 0 when perturbed by 1/n then the
GAR statistic evaluated in numerical software returns NaN the majority of
the time. This is because in this example the rank of Ω is very close to 3
given φ(θ4+θ5x)
Φ(−(θ4+θ5x))
≈ θ4 + θ5x unless x has high variability. Using R software
using only double bit precision the GAR statistic returns a warning message
quite a high proportion of times. This example considered x with a high
variation, where at κ = 0 then Ω is rank 4 and a NaN is still encountered at
a high frequency for n large as evidenced in Table 1.1 below. To combat this
a regularisation would be required in practise. Table 2 provides inference
19
Chapter 1
based on the rGAR statistic with K = 0.001. This negates the NaN issue
and coverage is still asymptotically correct. In practise, especially with high
dimensional moment problems many zero or almost zero eigenvalues in Ω a
regularisation will be necessary.
Table 1.1: GAR Rejection Probabilities: Heckman Selection
κ = 0 κ = 0.5 κ = 1
n = 100 0.088 0.074 0.071
n = 500 0.087 0.092 0.093
n = 1000 0.09 0.096 0.095
n = 5000 0.088(2%) 0.098 0.095
n = 50000 0.092(41.3%) 0.92 0.96
As seen in Table 1.1 for n large then θn is close the point of singularity and
for n large enough the NaN warning will be returned with increasingly high
frequency. One method to overcome this in smaller dimension problems with
would be to use high precision arithmetic to overcome the issue of rounding.
Table 2 repeats the analysis in Table 1.1 but now based on the rGAR statistic
as outline above.
Table 1.2: rGAR Rejection Probabilities: Heckman Selection
κ = 0.05 κ = 0.5 κ = 1
n = 100 0.088 0.083 0.075
n = 500 0.098 0.094 0.099
n = 1000 0.096 0.097 0.098
n = 5000 0.101 0.096 0.099
n = 50000 0.097 0.095 0.096
20
Identification Robust Inference with Singular Variance
1.4.2 Moment-Singularity Bias when Null(Ω) Null(GG )
A3(i) is critical in the proof of Theorem 3. When this condition is violated-
with examples given in Section 1.2.2 in general ˆTGAR(θn) is unbounded in
probability.
Theorem 4 (T4) : Under A1, A2, A3(ii)
ˆTGAR(θn)/n
p
→ ∆ G P0Φ−1
P0G∆ (1.15)
Where ∆ G P0Φ−1
P0G∆ > 0 since Φ is full rank by A3(ii). Hence the GAR
statistic is Op(n) when A3(ii) is violated. When A3(i) is almost violated
the GAR statistic is shown in the simulation below to be potentially very
oversized even for large sample sizes.
Theorem 4 is particularly striking as it implies there exist cases of correctly
specified moments which strongly identify θ0 where identification robust in-
ference based on the GAR statistic would (asymptotically) yield the empty
set. This would usually regarded as a sign of moment misspecification.
Simulation : Linear IV Simultaneous Equations
Consider Example 3 where
y1 = x1 + 1
y2 = 0.5x2 + 2
x1 = ¯π(1 + z) + η1
x2 = −¯π(1 + z2
) + η2
η1 = υ1 exp(−ζ1z), η2 = υ2 exp(−ζ2z)
υ1 =
1 + ρ
2
ζ1 +
1 − ρ
2
ζ2, υ2 =
1 + ρ
2
ζ1 −
1 − ρ
2
ζ2
21
Chapter 1
(υ1, υ2, 1, 2) |z
i.i.d
∼ N(04, Ξ) Ξ =






1 0 0.3 0
0 1 0.5 0
0.3 0 1 0
0 0.5 0 1






For each ¯π = {0, 0.1, 0.5} (uncorrelated, weak, strong) instruments the fol-
lowing simulation is performed. For instrument sets I1 = {1, z}, I2 =
{1, z, z2
} , I3 = {1, z, z2
, z3
} which respectively yield m = {4, 6, 8} mo-
ments rejection probabilities are formulated for the GAR statistic based
on a the 0.9 quantile of the relevant χ2
m based on 5000 repetitions where
θn = (1, 1/2) + 1/n for z
i.i.d
∼ N(0, 1) n = {100, 500, 1000, 5000, 50000},
ρ = {0.9995, 0.999995, 1} (ζ1, ζ2) = {(0, 0), (0, 0.5), (0, 1)}.
When ¯π = 0 the condition Null(Ω) ⊆ Null(GG ) is automatically satisfied,
in which case the GAR statistic should have a rejection probability around
0.1 for large sample sizes and is verified in Table 1.2. For brevity only the
case ζ1 = ζ2 = 0 is reported, similar results were found for both other cases.
When ¯π = 0 then when Ω is singular in directions G does not vanish the
GAR statistic is in general oversized
(i) When ρ = 1 and ζ1 = ζ2 = 0 then Ω is singular as shown in Example 3
δ Ω = 0 implies δ G = 0 if and only if ¯π = 0. The stronger the instruments
(the larger is ¯π) the more oversized the rejection probability for any m.
(ii) When ρ = 1 and ζ1 = ζ2 then Ω approaches a singular matrix as m
increases. Fixing ζ1 = 0 and let ζ2 equal 0.5 and 1. The larger is ζ2 the less
well that any m polynomials of z can approximate exp(ζ2/2z) (i.e h2(z)−1/2
from notation in Example 3). The GAR rejection probability is decreasing
in ζ2 for any given m, ¯π and increasing in both m and ¯π.
(iii) When ρ < 1 then Ω is full rank, however the closer ρ is to 1 in general
the larger the GAR statistic as ¯π increases. Even for large sample sizes the
rejection probabilities can be very close to 1.
Table 1.3 shows the rejection probabilities for the weak instrument case. As
expected when ρ = 1 and ζ1 = ζ2 = 0 the rejection probabilities converge
to 1 as n increases (since GAR is unbounded in this case for any m). For
ρ = 0.999995 and 0.9995 the rejection probabilities for any n,m are smaller
then when ρ = 1 however still oversized in small samples.
22
Identification Robust Inference with Singular Variance
Table 1.3: GAR Rejection Probabilities ¯π = 0
ρ = 0.9995 ρ = 0.99995 ρ = 1
m = 4 m = 6 m = 8 m = 4 m = 6 m = 8 m = 4 m = 6 m = 8
ζ1=ζ2=0
n = 100 0.099 0.080 0.074 0.090 0.092 0.077 0.099 0.904 0.074
n = 500 0.099 0.099 0.097 0.095 0.093 0.087 0.101 0.094 0.084
n = 1000 0.010 0.102 0.0891 0.097 0.103 0.096 0.098 0.094 0.09
n = 5000 0.098 0.093 0.103 0.093 0.106 0.104 0.101 0.097 0.099
n = 50000 0.010 0.102 0.096 0.098 0.101 0.098 0.102 0.091 0.102
As ζ2 increases then in general the rejection probabilities decrease for any ρ
as for any given m the instrument set less well approximate the null space of
Ω(z, θ0). As m increases the rejection probabilities increase.
This pattern is again observed in Table IV for strong instruments. In this
case the rejection probabilities for any given n, m, ρ, ζ2 is relatively more
oversized in general than when ¯π = 0.1. This corresponds to the fact the
condition Null(Ω) ⊆ Null(GG ) is potentially more strongly violated in this
case.
23
Chapter 1
Table 1.4: GAR Rejection Probabilities ¯π = 0.1
ρ = 0.9995 ρ = 0.999995 ρ = 1
m = 4 m = 6 m = 8 m = 4 m = 6 m = 8 m = 4 m = 6 m = 8
ζ1=ζ2=0
n = 100 0.135 0.123 0.198 0.428 0.38 0.8 0.492 0.421 0.867
n = 500 0.11 0.114 0.132 0.727 0.724 0.998 0.995 0.996 1
n = 1000 0.106 0.1 0.12 0.628 0.6412 0.992 1 1 1
n = 5000 0.091 0.103 0.095 0.251 0.253 0.599 1 1 1
n = 50000 0.092 0.1 0.108 0.117 0.11 0.15 1 1 1
ζ1=0ζ2=0.5
n = 100 0.117 0.118 0.36 0.204 0.292 0.8 0.218 0.329 0.85
n = 500 0.102 0.107 0.267 0.124 0.542 1 0.119 0.954 1
n = 1000 0.106 0.103 0.194 0.105 0.461 1 0.104 1 1
n = 5000 0.105 0.098 0.109 0.106 0.196 0.986 0.094 1 1
n = 50000 0.103 0.105 0.103 0.099 0.107 0.278 0.095 0.676 1
ζ1=0ζ2=1
n = 100 0.080 0.107 0.521 0.089 0.234 0.739 0.076 0.263 0.764
n = 500 0.094 0.099 0.623 0.086 0.247 1 0.094 0.314 1
n = 1000 0.087 0.099 0.42 0.099 0.162 1 0.095 0.199 1
n = 5000 0.098 0.088 0.150 0.093 0.102 0.972 0.102 0.096 1
n = 50000 0.101 0.096 0.095 0.099 0.095 0.230 0.104 0.098 1
24
Identification Robust Inference with Singular Variance
Table 1.5: GAR Rejection Probabilities ¯π = 0.5
ρ = 0.9995 ρ = 0.999995 ρ = 1
m = 4 m = 6 m = 8 m = 4 m = 6 m = 8 m = 4 m = 6 m = 8
ζ1=ζ2=0
n = 100 0.927 0.893 0.999 1 1 1 1 1 1
n = 500 0.495 0.477 0.939 1 1 1 1 1 1
n = 1000 0.317 0.286 0.706 1 1 1 1 1 1
n = 5000 0.145 0.144 0.222 1 1 1 1 1 1
n = 50000 0.104 0.102 0.110 0.530 0.544 0.964 1 1 1
ζ1=0ζ2=0.5
n = 100 0.761 0.895 1 0.988 1 1 0.992 1 1
n = 500 0.292 0.480 1 0.690 1 1 0.713 1 1
n = 1000 0.193 0.283 1 0.404 1 1 0.425 1 1
n = 5000 0.106 0.125 0.642 0.148 1 1 0.148 1 1
n = 50000 0.100 0.106 0.150 0.102 0.340 1 0.107 1 1
ζ1=0ζ2=1
n = 100 0.171 0.707 1 0.200 1 1 0.194 0.996 1
n = 500 0.101 0.277 1 0.097 0.996 1 0.089 0.955 1
n = 1000 0.096 0.182 1 0.091 0.936 1 0.098 0.349 1
n = 5000 0.088 0.109 0.978 0.094 0.306 1 0.092 0.102 1
n = 50000 0.095 0.105 0.220 0.102 0.108 1 0.091 0.102 1
1.5 Conclusion
This chapter studies identification robust inference based on the GAR statis-
tic with general forms of identification failure. As demonstrated the non-
singular variance assumption is inextricably linked to the assumption of first
order identification. This issue has largely been overlooked in the identifica-
tion literature. A notable exception is Andrews and Cheng (2012) who deal
25
Chapter 1
with the singular variance from identification failure under an assumption
the form of singular variance is known up to model parameters.
In order to study properties of the GAR statistic with singular variance
second order expansions of the eigensystem of the moment variance matrix
around the true parameter were derived. This asymptotic approach is new
in the identification literature and will prove useful for extending results for
other identification robust statistics.
Without making any identification assumptions (and hence allowing for gen-
eral forms of singular variance) the GAR statistic is asymptotically χ2
m under
a further set of conditions. Crucially one condition requires the null space
of the moment variance matrix lie within that of the outer product of the
expected first order derivative matrix. When this assumption is violated the
GAR statistic is unbounded. In this case confidence sets based on inverting
the GAR statistic would asymptotically yield the empty set. This result is
unknown in the literature and is termed the ‘moment-singular bias’
Examples of how this condition could be violated are provided. Roughly
speaking this problem can occur when moments are not weakly identified
and are perfectly correlated at the true parameter. This chapter models mo-
ments as exactly singular, an interesting extension would model moments as
weakly-singular. Namely model the smallest eigenvalues as shrinking to zero
at some rate, analogous to the weak-instrument methodology for modeling
weak identification. Simulation evidence shows that when the condition on
the null space of Ω and GG is almost not satisfied that the GAR statistic in
general is oversized.
The majority of the literature on properties of estimators and identification
robust inference make the assumption moments have non-singular variance
or singular variance of known form. This chapter is the first step in providing
a platform to extend results in other settings without making a non-singular
variance assumption, or assumptions on the form of singularity Andrews &
Cheng (2012). Examples include dropping the non-singular variance assump-
tion for identification robust inference from the GEL objective function made
in Guggenberger, Ramalho & Smith (2008).
26
Identification Robust Inference with Singular Variance
1.6 Appendix
Appendix A1: Auxiliary Lemmas
Lemma A1: w.p.1
ˆΛ0 = 0
Proof of Lemma A1:
P0ΩP0 = 0 by definition of P0.
E[P0gigiP0] = 0
Since P0gigiP0 is p.d then P0gi = 0 a.s(z) Hence P0
ˆΩP0 = 1
n
n
i=1 P0gigiP0 =
0 So ˆΩP0 = 0 w.p.1 then ˆP0 = P0H w.p.1 for some full rank ¯m × ¯m matrix
H since ˆΩ ˆP0 = 0 by definition, hence ˆΛ0 = 0
Q.E.D
Lemma A2: Let ˆA and A be two square symmetric matrices of dimension
r where Rank(A) = ¯r and || ˆA − A|| = Op( n) for some bounded non-negative
sequence n. Eigen-decompose A = RDR where RR = Ir×r and RDR =
R+D+R+ +R0D0R0 where D0 = 0¯rׯr and D+ is a full rank diagonal (r−¯r)×
(r−¯r) matrix with the eigenvalues of A on the diagonal where 0 ≤ ||D+|| ≤ K
for K < ∞. Similarly express ˆA = ˆR ˆD ˆR = ˆR+
ˆD+
ˆR+ + ˆR0
ˆD0
ˆR0. Define
B = ˆA − A then it is true that,
ˆR+ = R+ − R0R0B R+D−1
+ + Op( 2
n)
ˆR0 = R0 − R+D−1
+ R+BR0 + Op( 2
n)
Proof of Lemma A2: This result follows from equations (8),(9) in Hassani
27
Chapter 1
et al. (2011) .
Q.E.D
Also note that By CS||R0R0B R+D−1
+ || ≤ ||R0||2
Op(||D+||)Op(||B||) = Op( n)
since ||D+|| = O(1) then
ˆR+ = R+ + Op( n)
follows from Lemma A2 which is used in the proofs of Theorem 1-4.
Lemma A3: Under A1,A2
||∆n||−2
P0
ˆΩ(θn)P0
p
→ Γ
Proof of Lemma A3:
ˆΩ(θn) =
1
n
n
i=1
gi(θn)gi(θn)
Taylor expand gi(θn) around θ0
gi(θn) = gi + Gi(¯θn)∆n (1.16)
Where ¯θn is a vector between θ0 and θn Define ¯Gi := Gi(¯θn)
ˆΩ(θn) = ˆΩ +
1
n
n
i=1
¯Gi∆n∆n
¯Gi +
1
n
n
i=1
gi∆n
¯Gi +
1
n
n
i=1
¯Gi∆ngi (1.17)
By Lemma A1(i) Pr{P0gi(θ0) = 0} = 1 so that w.p.1
P0
ˆΩ(θn)P0 =
1
n
n
i=1
P0
¯Gi∆n∆n
¯GiP0 (1.18)
=
1
n
n
i=1
P0(( ¯Gi−Gi)∆n∆n
¯Gi+Gi∆n∆n( ¯Gi−Gi) )P0+
1
n
n
i=1
P0Gi∆n∆nGiP0
28
Identification Robust Inference with Singular Variance
By repeated application of CS
||
1
n
n
i=1
P0( ¯Gi − Gi)∆n∆nGiP0|| ≤ ||∆n||2
||||P0||2 1
n
n
i=1
||Gi( ¯Gi − Gi)||
≤ ||∆n||2
||||P0||2 1
n
n
i=1 ||Gi||1
n
n
i=1 || ¯Gi − Gi||
By A2(iii) 1
n
n
i=1 || ¯Gi − Gi|| = Op(||∆n||) and 1
n
n
i=1 ||Gi|| = Op(1) by A2
(i),(iv). Since ||P0|| = ¯m < ∞ by A1(iii)
||
1
n
n
i=1
P0(( ¯Gi − Gi)∆n∆nGiP0|| = Op(||∆n||3
) (1.19)
Similarly it can be shown that ||1
n
n
i=1 P0(( ¯Gi −Gi)∆n∆n
¯GiP0|| = Op(||∆n||3
)
Define ˆΓn = P0
1
n
n
i=1 Gi
¯∆n
¯∆nGiP0, Γn = P0
1
n
n
i=1 E[Gi
¯∆n
¯∆nGi]P0,
Then by (??) and (1.19) substituted in to (1.18) implies
||∆n||−2
P0
ˆΩ(θn)P0 = ˆΓn + Op(||∆n||) (1.20)
Finally to show ˆΓn
p
→ Γ establishing the result
As E[ˆΓn] = Γn and by application of CS
||Γn|| ≤ || ¯∆n||2
E[||Gi||2
] = O(1) (1.21)
Where ¯∆n = O(1) and by A2(iv) E[||Gi||2
] = O(1)
Under A2(i) wi(i = 1, .., n) is i.i.d and E[ˆΓn] = Γn → Γ by CMT (since
¯∆n
¯∆n → ∆∆ and Γn is a continuous function of the bounded sequence ¯∆n).
An application of the Khinctine Weak Law of Large of Numbers (KWLLN)
element by element to ˆΓn then ˆΓn
p
→ Γ and by (1.20) noting that ||∆n|| =
op(n−1/2
) establishes the result.
Q.E.D
29
Chapter 1
Lemma A4: Under A1, A2
||∆n||−1
P0
ˆΩ(θn)
p
→ Ψ
Proof of Lemma 4:
By Lemma A1(i) and (1.17)
P0
ˆΩ(θn) = P0
n
i=1
¯Gi∆ngi + P0
1
n
n
i=1
¯Gi∆n∆n
¯Gi (1.22)
Where ||P0
1
n
n
i=1
¯Gi∆n∆n
¯Gi|| = Op(||∆n||2
) as shown in the proof of Lemma
A3 as ||Γ|| = O(1) by (1.21)
P0
ˆΩ(θn) = P0
n
i=1
¯Gi∆ngi + Op(||∆n||2
) (1.23)
By CS,
||P0
1
n
n
i=1
( ¯Gi − Gi)∆ngi|| ≤ ||P0||
1
n
n
i=1
|| ¯Gi − Gi||||∆n||||
1
n
n
i=1
||gi|| (1.24)
Where 1
n
n
i=1 ||gi|| = Op(1) by KWLLN under A2(i) and A2(ii) that E[||gi||2
] =
O(1) and 1
n
n
i=1 || ¯Gi − Gi|| = Op(||∆n||) by A2(iii) so that ||P0
1
n
n
i=1( ¯Gi −
Gi)∆ngi|| = Op(||∆n||2
). Define ˆΨn := P0
1
n
n
i=1 Gi
¯∆ngi, Ψn = P0E[Gi
¯∆ngi]
then by (1.24)
||∆n||−1
P0
1
n
n
i=1
Gi∆ngi = ˆΨn + Op(||∆n||) (1.25)
Since E[ˆΨn] = Ψn where Ψn is bounded for all n since by CS
||Ψn|| ≤ || ¯∆n||E[||Gi||]E[||gi||] (1.26)
Where E[||Gi||] = O(1) E[||gi||] = O(1) by A2 (ii),(iv). By the KWLLN
ˆΨn
p
→ Ψn where || ¯∆n|| = O(1) where Ψn → Ψ by CMT establishing the
30
Identification Robust Inference with Singular Variance
result.
Q.E.D
Appendix A2: Main Theorems
Proof of Theorem 1:
Define the following from Lemma A2
ˆA = ˆΩ(θn), A = Ω where B = ˆΩ(θn) − Ω and ||ˆΩ(θn) − Ω|| ≤ ||ˆΩ(θn) − ˆΩ|| +
||ˆΩ−Ω|| by T , ||ˆΩ(θn)− ˆΩ|| = Op(||∆n||) by A1(ii)so that n := ||ˆΩ−Ω||∧||∆n||
where R+ = P+, R0 = P0, ˆR+ = ˆP+(θn), ˆR0 = ˆP0(θn) and D+ = Λ+ then
Since ||Λ−1
+ ||||P0||||P+||||ˆΩ(θn)− ˆΩ|| = O(1)Op(||∆n||) since m = O(1) by A1 (iii)
hence ||P0|| = ¯m = O(1) where 0 ≤ ¯m ≤ m and ||P+|| = m − ¯m = O(1) where
||Λ−1
+ || = O(1) by A1(i).
Then by Lemma A2
ˆP+(θn) = P+ + Op(||ˆΩ − Ω|| ∧ ||∆n||) (1.27)
Establishing (1.5).
||ˆΛ(θn) − Λ|| ≤ ||ˆΩ(θn) − Ω|| (1.28)
By Theorem 4.2 of Bosq (2000). Where it has been shown that ||ˆΩ(θn)−Ω|| =
Op(||ˆΩ − Ω|| ∧ ||∆n||) establishing (1.6).
Now to show (1.7) and (1.8) again using Lemma A2,
ˆP0(θn) = P0 − Ω∗
+
ˆΩ(θn)P0 + Op((||∆n|| ∧ ||ˆΩ − Ω||)2
) (1.29)
Establishing (1.7).
ˆΛ0(θn) = ˆP0(θn) ˆΩ(θn) ˆP0(θn) (1.30)
= ( ˆP0(θn) − P0) ˆΩ(θn)( ˆP0(θn) − P0) + P0
ˆΩ(θn)( ˆP0(θn) − P0)
+( ˆP0(θn) − P0) ˆΩ(θn)P0 + P0
ˆΩ(θn)P0
31
Chapter 1
Where by (1.7) ˆP0(θn) − P0 = −Ω∗
+
ˆΩ(θn)P0 + Op(||∆n||2
)
Noting that Ω = Ω+ and by CS ||Ω∗
+
ˆΩ(θn)P0|| ≤ ||Ω∗
+||||P0||||ˆΩ(θn) − ˆΩ(θ0|| =
Op(||∆n||) since ||Ω∗
+|| = O(1) by A1(i) and P0
ˆΩ(θn) = P0(ˆΩ(θn) − ˆΩ(θ0)) by
Lemma A1(i) so that,
( ˆP0(θn) − P0) ˆΩ(θn)( ˆP0(θn) − P0) (1.31)
= P0
ˆΩ(θn)Ω∗
+
ˆΩ(θn)P0 + Op(||∆n||3
)
P0
ˆΩ(θn)( ˆP0(θn) − P0) (1.32)
= −P0
ˆΩ(θn)Ω∗
+(ˆΩ(θn)P0 + Op((||∆n|| ∧ ||ˆΩ − Ω||)3
)
Hence plugging (1.31),(1.32) in to(1.30)
ˆΛ0(θn) = P0
ˆΩ(θn)P0 −P0
ˆΩ(θn)Ω∗
+
ˆΩ(θn)P0 +Op((||∆n||∧||ˆΩ−Ω||)3
) (1.33)
Which establishes (1.8).
Q.E.D
Proof of Theorem 2:
By (1.7)
ˆP+ = P+ + Op(||∆n|| ∧ ||ˆΩ − Ω||) (1.34)
ˆΛ+ = Λ+ + Op(||∆n|| ∧ ||ˆΩ − Ω||) (1.35)
Where ||ˆΩ − Ω|| = Op(n−1/2
) by A2(i),(ii) and ||∆n|| = op(n−1/2
) establishing
(1.9),(1.10).
By T1
||∆n||−1
( ˆP0(θn) − P0) = −||∆n||−1
Ω∗
+
ˆΩ(θn)P0 + op(n−1/2
) (1.36)
Since ||∆n||−1
Op((||∆n|| ∧ ||ˆΩ − Ω||)2
) = op(n−1/2
) since ||∆n|| = op(n−1/2
) By
32
Identification Robust Inference with Singular Variance
the CMT and Lemma A3 ||∆n||−1
Ω∗
+
ˆΩ(θn)P0
p
→ Ω∗
+Ψ establishing (1.11).
By (1.8)
||∆||−2 ˆΛ0(θn) = ||∆n||−2
P0
ˆΩ(θ0)P0 (1.37)
−||∆n||−2
P0
ˆΩ(θn)Ω∗
+
ˆΩ(θn)P0 + op(n−1/2
)
Since ||∆n||−2
Op((||∆n||∧||ˆΩ−Ω||)3
) = op(n−1/2
) By Lemma A2 ||∆n||−2
P0
ˆΩ(θ0)P0
p
→
Γ and by Lemma A3 and CMT
||∆n||−2
P0
ˆΩ(θn)Ω∗
+
ˆΩ(θn)P0
p
→ ΨΩ∗
+Ψ establishing (1.12).
Q.E.D
Proof of Theorem 3:
ˆTGAR(θn) = n ˆP+(θn) ˆg(θn) ˆΛ+(θn)−1 ˆP+(θn)ˆg(θn) (1.38)
+n ˆP0(θn) ˆg(θn) ˆΛ0(θn)−1
n ˆP0(θn) ˆg(θn)
Using the expansion of ˆg(θn) around θ0 summed across i in (1.15)
√
nˆg(θn) =
√
nˆg(θ0) +
√
n ˆG(¯θn)∆n (1.39)
By repeated application of CS,
||
√
n( ˆG(¯θn)− ˆG(θ0))∆n|| ≤
√
n||∆n||
1
n
n
i=1
|| ¯Gi −Gi|| = Op(n1/2
||∆n||2
) (1.40)
By A2 (ii) where ||∆n||2
n1/2
= op(n−1/2
) hence
√
nˆg(θn) =
√
nˆg(θ0) +
√
n ˆG(θ0)∆n + op(n−1/2
) (1.41)
Firstly establish that
n( ˆP+(θn) ˆg(θn)) ˆΛ(θn)−1 ˆP+(θn) ˆg(θn) = n(P+ˆg(θ0)) Λ−1
+ P+ˆg(θ0) + op(1)
33
Chapter 1
(1.42)
By (1.9) ˆP+(θn) = P+ + op(1) and (1.41)
ˆP+(θn)
√
nˆg(θn) = P+(
√
nˆg(θ0) + ˆG(θ0)
√
n∆n) + op(1) (1.43)
= P+
√
nˆg(θ0) + op(1) (1.44)
Since ||P+
ˆG(θ0)
√
n∆n|| ≤ n1/2
||P+|||| ˆG(θ0)|||||∆n|| = n1/2
O(1)Op(1)op(n−1/2
) =
op(1).
ˆΛ+(θn) = Λ+ + op(1) by (1.10) and under A1 (i) then Λ−1
+ exists so that by
CMT
ˆΛ+(θn)−1
= Λ−1
+ + op(1) (1.45)
Together with (1.44) implies (1.42) so that n( ˆP+(θn) ˆg(θn)) ˆΛ(θn)−1 ˆP+(θn) ˆg(θn)
d
→
χ2
¯m. Since
√
nP0 = ˆg(θ0)
p
→ N(0, Λ+) by A2(i),(ii) and the Lindberg-Levy
Central Limit Theorem. We now go on to derive the limit distribution of
n ˆP0(θn) ˆg(θn) ˆΛ0(θn)−1
n ˆP0(θn) ˆg(θn).
Under A1, A2 , A3 it can be shown that
||∆n||−1
√
n ˆP0(θn) ˆg(θn) = P0
√
n( ˆG(θ0)−G) ¯∆n−ΨΩ∗
+
√
nˆg(θ0)+op(1) (1.46)
By (1.11) ||∆n||−1
( ˆP(θn) − P0) = −Ω∗
+Ψ + op(1)
||∆n||−1
√
n ˆP0(θn) ˆg(θn) = (−Ω∗
+Ψ + op(1))
√
nˆg(θn) + ||∆n||−1
P0
√
nˆg(θn)
(1.47)
Where by (1.39)
√
nˆg(θn) =
√
nˆg(θ0)+op(1) hence (−Ω∗
+Ψ +op(1))
√
nˆg(θn) =
−ΨΩ∗
+
√
nˆg(θ0) + op(1) To established the first part on the right hand side of
(1.46) note that
||∆n||−1
P0
√
nˆg(θn) = P0( ˆG(θn) − G) ¯∆n + op(1) (1.48)
34
Identification Robust Inference with Singular Variance
Since by Lemma A1 (i) P0
√
nˆg(θ0) = 0 w.p.1. and by A3(i) P0G = 0. By
(1.12)
||∆n||−2 ˆΛ0(θn) = Φ + op(1) (1.49)
Where Φ is p.d by A3(ii). By CMT and (1.49)
(||∆n||−2 ˆΛ0(θn))−1
= Φ−1
+ op(1) (1.50)
Together (1.46),(1.50) establish that w.p.a.1
n( ˆP0(θn) ˆg(θn)) ˆΛ0(θn)−1 ˆP0(θn) ˆg(θn)) (1.51)
= (P0(
√
n( ˆG(θ0)−G) ¯∆n−ΨΩ∗
+
√
nˆg(θ0))) Φ−1
(P0(
√
n( ˆG(θ0)−G) ¯∆n−ΨΩ∗
+
√
nˆg(θ0)))
Now it can be established that
P0(
√
n( ˆG(θ0) − G) ¯∆n − Ψ Ω∗
+
√
nˆg(θ0))
d
→ N(0, Φ) (1.52)
Define bi = P0((Gi − G) − ΨΩ∗
+gi) and ¯∆n = ∆n/||∆n||.
P0(
√
n( ˆG(θ0) − G) ¯∆n − ΨΩ∗
+
√
nˆg(θ0)) =
1
√
n
n
i=1
bi (1.53)
Where E[ 1√
n
n
i=1 bi] = 0
E[
1
n
n
i=1
bibi] = P0E[Gi
¯∆n
¯∆n Gi]P0 − ΨnΩ∗
+ΩΩ∗
+Ψn (1.54)
By A1 (i) that wi is i.i.d and by definition Ψn = P0E[Gi
¯∆ngi] → Ψ since
¯∆n → ∆ where ||∆|| < ∞ (and likewise Γn := P0E[Gi
¯∆n
¯∆nGi]P0 → Γ by
CMT) as E[||Gi||2
] < ∞, E[||gi||2
] < ∞ by A2(ii),(iv).
E[
1
n
n
i=1
bibi] → Φ (1.55)
35
Chapter 1
As wi is i.i.d then so is bi and Φ then by the Multivariate Lindberg-Levy
Central Limit theorem (note technically bi is a function of n, though only
through ¯∆n where ¯∆n = ¯∆ + op(1) hence we can appeal to this theorem
w.p.1.)
P0(
√
n( ˆG(θ0) − G) ¯∆n − ΨΩ∗
+
√
nˆg(θ0))
d
→ N(0, Φ) (1.56)
Hence (1.51) converges in distribution to χ2
¯m since both terms on right hand
side of (1.38) are orthogonal asymptotically and the sum of the two is asymp-
totically χ2
m.
Q.E.D
Proof of Theorem 4:
Divide equation (1.51) by n (noting that P0G = 0 since A3(i) is violated) it
is straightforward to establish that
( ˆP0(θn) ˆg(θn)) ˆΛ0(θn)−1 ˆP0(θn) ˆg(θn))
p
→ ∆ G P0Φ−1
P0G∆ (1.57)
By A2(i),(iv) then P0
ˆG(θ0)
p
→ P0G. Since the first term on the right hand
side of (1.38) converges to zero in probability when divided by n, then it is
straightforward to establish that ˆTGAR(θn)/n
p
→ ∆ G P0Φ−1
P0G∆.
Q.E.D
36
Chapter 2
Overcoming The Many Weak
Instrument Problem Using
Normalized Principal
Components
Abstract
Principal Component (PC) techniques are commonly used to improve the small
sample properties of the Linear Instrumental Variables (IV) estimator. Carrasco
(2012) argue that PC type methods provide a natural ranking of instruments with
which to reduce the size of the instrument set. This chapter shows how reducing
the size of the instrument based on PC methods can lead to poor small sample
properties of IV estimators. A new approach to ordering instruments termed ‘Nor-
malized Principal Components’ (NPC) is introduced to overcome this problem. A
simulation study shows the favorable small samples properties of IV estimators
using NPC methods to reduce the size of the instrument relative to PC. Using
NPC evidence is provided that the IV setup in Angrist & Krueger (1992) may not
suffer the weak instrument problem.
Keywords: Many Weak Instrument Bias, Instrument Selection, Principal Com-
ponents.
37
Chapter 2
2.1 Introduction
The many weak instrument bias for linear IV estimators1 is now widely recognized
and understood in the literature. The small sample (higher order) bias of IV
estimators are a function of both the size and the strength of a set of instruments.
In general this bias is increasing with the size and decreasing in the strength of a
set of instruments 2, (Rothenberg (1984), Staiger & Stock (1997), Stock , Wright
& Yogo (2002), Hahn & Hausman (2003), Hahn , Hausman & Kuersteiner (2004),
Newey & Smith (2004), Chao & Swanson (2005), Hahn, Hausman & Newey (2008),
Newey & Windmeijer (2009)).
A common instrument reduction technique utilized in IV settings is Principal Com-
ponents. The Principal Components method applied to IV models is now well
documented in both theoretical and applied research, (Kloek & Mennes (1960),
Amemiya (1966), Doran & Schmidt (2006), Winkelreid & Smith (2011), Carrasco
(2012), Carrasco & Tchuente (2012)). Doran & Schmidt (2006) consider the small
sample properties of dynamic panel GMM estimators. They provide a heuristic
argument why dropping those Principal Components (PCs) that explain the least
amount of variation within a set of moments could improve small sample properties
of GMM. When a subset of PCs are (almost) irrelevant Doran & Schmidt (2006)
argue the PC method may be able reduce the dimension of the moments used for
estimation with potentially little loss in efficiency. This logic underlies much of
the literature utilizing Principal Components to reduce the size of the moments to
improve small sample properties of GMM type estimators.
Carrasco (2012) and Carrasco & Tchuente (2012) derive higher order Mean Square
Error (MSE) approximations similar to Donald & Newey (2001); though for a po-
tentially infinite number of instruments. Various regularization methods are con-
sidered to invert the sample covariance matrix of the instruments in finite samples.
Principal Components and related regularization techniques are considered in both
papers. When the size of the instrument set is less than the sample size the MSE
approximations in Carrasco (2012) and Carrasco & Tchuente (2012) collapse to a
1
Throughout the chapter when referring to an IV estimator we refer in general to any
estimator based on a set of instruments (e.g Two Stage Least Squares (2SLS), Generalized
Method of Moments (GMM), Generalized Empirical Likelihood (GEL)) unless specifically
stated otherwise.
2
For ease of exposition this chapter will illustrate ideas based on the one endogenous
variable, many exogenous variables IV setup. The ideas generalize naturally to the case
of many endogenous regressors.
38
Many Weak Instruments & Normalized Principal Components
modified version of the MSE approximations in Donald & Newey (2001).
Irrespective of the size of the instrument set, the underlying premise of Carrasco
(2012) and Carrasco & Tchuente (2012) is the same as much of the other liter-
ature on PC type methods for instrument reduction. Namely that transforming
instruments in to their PCs and ranking each PC by their variance provides a
good ranking of these transformed instruments in terms of their correlation with
the endogenous variable.
Though at first seemingly intuitive, the premise of PC methods when applied
to reducing the size of the instrument set can be crucially flawed. PC methods
generally reduce the dimension of the instrument set by keeping those PCs with
the largest variance (i.e those linear combinations of the instruments that explain
most of the variation within the instrument set). This chapter demonstrates how
those PCs that explain most of the variation within the instrument set need not
explain any of the variation in the endogenous variable. In fact it is possible that
the total variation of the endogenous variable explained by the total variation of
instrument set lies in those PCs with the smallest variance.
However these are exactly the PCs that are dropped using PC methods of instru-
ment reduction. Hence it is entirely plausible that selecting instruments based on
PC methods may lead to poor small sample properties of IV estimators. This sit-
uation could arise even when there exist linear combinations of instruments with a
strong correlation to the endogenous variable. As such an adapted method of in-
strument ordering is derived to overcome this problem. This method of instrument
dimension reduction is termed ‘Normalized Principal Components ’ (NPC).
NPC transforms the instrument set in such a way that these transformed instru-
ments may be ranked by the amount of variation of the endogenous variable that
they explain. To do this NPC normalizes all PCs to have equal variance. NPC then
estimates parameters from a least squares regression of the endogenous variable on
the NPCs. The square of these estimated parameters (under regularity conditions)
are shown to form a consistent ranking of the variation each corresponding NPC
explains of the total variation in the endogenous variable.
This method provides a natural and clear way to order a set of instruments in
terms of their strength. Using this ranking instruments may be selected efficiently
in some way to improve the small sample properties of the IV estimator. For
example an ad-hoc rule could be adopted, e.g selecting all NPCs that have t-
values in the first stage regression above a certain threshold. NPCs also provide a
39
Chapter 2
natural ordering of instruments with which to minimize the MSE approximations
of Donald & Newey (2001). This is useful as Donald & Newey (2001) point out
the practical use of MSE approximations with large instrument sets are limited
without some a priori ranking of instrument strength.
In order to implement NPC in practice a precise estimate of both the variance
matrix of the instruments and the first stage parameters from a regression of the
endogenous variable on the NPC are required. When the sample size is small
relative to the number of instruments these estimates may be imprecise. The
NPC method of ranking of instruments may be poor in this case. However the
PC method of Carrasco (2012) and similar suffer this drawback also. In that PC
methods rely upon a precise estimate of the covariance matrix of the instruments.
PC methods also suffer the further drawback that ordering PCs by their variances
(eigenvalues) may be a poor indicator of their correlation with the endogenous
variable even asymptotically.
There exist many classic examples of many weak instrument problems where the
sample size far outweighs the number of instruments. For example the returns to
education data of Angrist & Krueger (1992) that uses the Vietnam War Lottery
dummies as in instrument for education attainment. The sample size is over 25000
with only 130 instrumental variables. Another example is the Angrist & Krueger
(1991) wage-education data using Quarter of Birth as instrument with a sample
size of over 300,000 and 180 instruments. With such data sets one may expect
precise estimates of the covariance matrix of the instruments and the relevant first
stage parameters.
It is widely regarded in the literature that the Vietnam War Lottery as an instru-
ment for educational attaintment is weak (Bound, Jaeger & Baker (1995), Angrist
& Krueger (1995)). However using the NPC method applied to this set of in-
strument it is in fact demonstrated that there exist to 14 statistically significant
(p < 0.1) linear combinations from the total 130 instruments considered in Angrist
& Krueger (1992). As such this IV problem may not suffer the weak instruments
problem. The poor small sample properties in Angrist & Krueger (1992) may be
due to the poor small sample properties of 2SLS with many instruments. Various
estimators based on a reduced set of instruments selected by using various criterion
based on the NPC ordering estimates a return to education much lower than the
corresponding OLS estimator in Angrist & Krueger (1992). This conforms with
our a priori notion of the sign of the bias in wage-education regressions; unlike
40
Many Weak Instruments & Normalized Principal Components
2SLS based on all instruments which estimates a return to education much larger
than that of OLS.
A simulation experiment compares the estimation error in forming an IV estimator
based on an ordering of instruments using the NPC ranking relative to that of PC.
Both the NPC and PC method are utilised as a basis with which to minimize
small sample MSE approximations of Donald & Newey (2001). The simulation
study demonstrates the favorable small sample properties of IV based on ranking
PCs by the NPC method as opposed to that of PC. The PC approach to selecting
instruments is shown in some cases to yield IV estimators with extremely poor
small sample properties when the PC ranking of instruments is poor.
Section 2.2 recaps the literature on instrument selection. Section 2.3 details the
potential problem of PC and shows how the NPC method of ranking instruments
overcomes this. Section 2.4 presents a small simulation study to show the perfor-
mance of NPC as an ordering of instruments to minimize the MSE of Donald &
Newey (2001) relative to the PC methods considered in Carrasco (2012). Section
2.5 applies the NPC method of choosing instruments to Angrist & Krueger (1992).
Section 2.6 provides concluding remarks. Proofs of Lemmas along with details on
how to practically implement the NPC method detailed in the chapter (along with
R code) are collected in to an Appendix.
2.2 Instrument Selection Methods
The poor small sample properties of IV estimators with many (weak) instruments
is now well documented in the literature. In light of this problem, a thriving area
of research considers methods to select instruments to reduce this many (weak)
instrument bias, (Hall, Rudebusch & Wilcox (1996), Shea (1997), Donald & Newey
(2001), Hall & Peixe (2003), Donald, Imbens & Newey (2009), Kuersteiner & Okui
(2010), Carrasco (2012)).
The literature on methods to select instruments is now vast. This chapter focuses
mainly on the methods derived in Donald & Newey (2001) and Carrasco (2012);
providing a compact review of the instrument selection techniques from each paper.
Donald & Newey (2001) derive an approximation to the small sample mean squared
error of the linear homoscedastic IV estimator as a function of any given instrument
set.
41
Chapter 2
The MSE type approximations of Donald & Newey (2001) are sketched in Section
2.2.1. Essentially these expansions provide an approximation to the small sample
MSE of various estimators based on a given set of instruments. As such they
provide a natural criterion with which to select instruments to efficiently reduce
the dimension of the instrument set.
2.2.1 MSE Approximations of Donald & Newey (2001)
Donald & Newey (2001) [DN] consider the following linear IV model.
yi = xiβ + i (2.1)
xi = f(zi) + ηi (2.2)
Where xi is a px1 vector of endogenous/exogenous variables and zi is a m∗ × 1
vector of variables such that E[ηi|zi] = 0, E[ 2
i |xi] = σ2 ,E[ηi i|zi] = ση . Where
ση is a p × 1 vector and σ2 > 0. f(·) is a p × 1 function of the excluded exogenous
variables zi. An mx1 vector of instruments Zi = φ(zi) can be formed where φ(·)
is some m × 1 function of the exogenous variables zi. Common examples include
polynomials of zi.
xi = ΠmZi + m(Zi) + ηi (2.3)
Where m(Zi) is a p × 1 vector containing the approximation error for a given
instrument set Zi (ie m(Zi) = f(zi) − ΠmZi). The assumption is that as m → ∞
some linear p × m linear combination Πm of Zi approximate f(zi) with arbitrary
precision ( i.e E[||m(Zi)||2] → 0 as m → ∞ at some rate). Define Z = (Z1, .., Zn) .
For every set of instrument set of size m, the asymptotic variance is estimated by
the usual estimate of the Semiparametric Lower Bound (SPLB) and the bias is
estimated as some function of m and Z. DN derive higher order expansions for
the 2SLS, Limited Information Maximum Likelihood (LIML) and Bias Corrected
2SLS. For brevity only the MSE approximation for 2SLS is detailed.
For each of the estimators under certain restrictions on the growth rate of m
relative to the n the MSE is of the form
n(ˆβ − β0)(ˆβ − β0) = σ2
H−1
+ S(m) + r(m) (2.4)
42
Many Weak Instruments & Normalized Principal Components
Where S(m) varies for the different estimators, r(m) is asymptotically negligible
and σ2H−1 is the asymptotic variance (SPLB) where H := E[f(zi)f(zi) ].
For 2SLS under the assumption that m2/n → 0 along with other regularity con-
ditions DN show that the above approximation holds with,
S(m) = H−1
(ση ση
m2
n
+ σ2 f (I − P)f
n
)H−1
(2.5)
Where P := Z(Z Z)−1Z is the projection matrix of the instruments Z, f :=
(f(z1), .., f(zn)) . In practice σ2H−1 (the asymptotic variance of 2SLS) and S(m)
(the higher order bias) are estimated using their sample counterparts, see Donald
& Newey (2001) for a discussion. Appendix A in Section 2.9 provides the R code
for the estimating σ2H−1 + S(m)3.
The number of instrument sets over which to choose increases exponentially with
m and becomes computationally challenging. Optimizing over such a large set of
potential instruments combination may also lead to unstable estimates and give
poor second stage estimators. In light of this DN argue an a priori notion of
which instruments are strongest is required to reduce the dimension of the discrete
optimisation problem. This knowledge is not often known or predicted by theory
in general.
The regularization approach of Carrasco (2012) and Carrasco & Tchuente (2012)
provide one such ranking based on Principal Components and related techniques.
They also generalize the expansions of DN (2001) for 2SLS and LIML for the case
where m > n and possibly infinite.
2.2.2 The Regularization MSE Approach
Carrasco (2012) consider the same linear IV setup as in DN detailed in Section 2.1.
Carrasco (2012) consider the case where m may be greater than n and possibly in-
finite. When faced with an infinite number of instruments (or more generally when
m > n) the problem is ill posed since the sample covariance of the instruments is
singular. Carrasco (2012) use various regularization methods to approximate the
sample covariance matrix of the instruments and generalize the MSE approxima-
3
Specifically the code allows one to perform the NPC ranking detailed in Section 2.6
and estimates the MSE expansions of DN evaluated as a function of NPCs. It would be
easy to modify the code and evaluate the MSE approximations with a different form of
instruments.
43
Chapter 2
tions in DN (2001) for 2SLS. Carrasco & Tchuente (2012) provide similar results
for the LIML estimator.
Carrasco (2012) use a regularized inverse of the instrument sample covariance
matrix. The intuition is highlighted for the Principal Components and Spectral
Cut-Off regularization. Define Kn as the (potentially infinite dimensional) sample
covariance matrix of Z, λjn the j’th sample eigenvalue and φjn the corresponding
sample eigenvector ( <, ., > is the inner product with respect to the Euclidean
Norm) such that for any vector r (conformable with the dimension of Kn) ,
K−1
n r :=
∞
j=1
1
λjn
<r, φjn> φjn (2.6)
In finite samples a truncation of the (infinite dimensional) sample covariance ma-
trix is required to form a problem which is tractable and that yields an asymptot-
ically valid approximation.
The different truncations used correspond to the differing methods of regulariza-
tion. For example the Spectral Cut-Off regularization approximates the Kn using
eigenvectors with eigenvalues greater than some threshold α (using eigenvectors
with the largest eigenvalues first as these correspond to most of the variation within
the instrument).
(Kα
n )−1
r :=
λ2
jn≥α
1
λjn
<r, φjn> φjn (2.7)
Kα
n is the truncated approximation to the Kn. α is the tuning parameter and as
discussed in Carrasco (2012) can be viewed as the counterpart to the tuning param-
eter in non-parametric estimation. See Carrasco (2012) for details on regularizing
potentially infinite dimensional matrices and for other forms of regularization.
The PC regularization is directly linked to the SC method approximates Kn using
those eigenvectors with corresponding eigenvalues less than or equal to 1/α. Then
for α → 0 fast enough relative to n → ∞ Carrasco (2012) shows the linear IV-
GMM estimator based on this regularization will reach the SPLB and be consistent
and asymptotically normal under certain regularity conditions. Here the tuning
parameter α is playing the role of m the number of instruments in DN. Carrasco
(2012) derives the MSE as a function of α for α2
n → 0
n(ˆβ − β0)(ˆβ − β0) = σ2
H−1
+ S(α) + r(α) (2.8)
44
Many Weak Instruments & Normalized Principal Components
Where r(α) is asymptotically negligible and,
S(α) = H−1

ση ση
q(α, λ2
j )
n
+ σ2 f (I − Pα)f
n

 H−1
(2.9)
Where for Principal Components q(α, λj) = I(j ≤ 1
α) and Pα is the projection on
the space spanned by the first n eigenvectors of the truncated sample covariance
matrix. Note the similarity with DN where now the tuning parameter is α instead
of m.
We consider a special case of Carrasco (2012) where m ≤ n. In this case the MSE
approximation in Carrasco (2012) for PC collapse to those of DN (2001) where the
MSE approximations in DN are evaluated at the PCs as opposed to the original
instruments Z. This is the setting considered in the simulation experiment in
Section 2.4. Though the Carrasco (2012) MSE approximations are more general
in allowing m > n, PCs are ordered by their eigenvalues which is shown in Section
2.3 to potentially lead to an IV-GMM estimator with poor small sample properties.
Instrument Reduction Techniques
This section discusses the PC method of instrument reduction and highlights
the potentially critical flaw in this technique for instrument selection. The NPC
method is introduced and is demonstrated to overcome the flaw of PC methods.
Conditions under which NPC works well asymptotically are also sketched.
2.3 Principal Components Ranking of In-
struments
When faced with an mx1 set of instruments Zi then it is plausible there may
exist some linear combination of these variables that explain a large portion of
the variation within Zi. Principal Components is a method that identities which
linear combinations of the variables Zi explain most of the total variation in Zi.
Define Σ := V ar(Zi) the unknown population covariance matrix of the instru-
ments. Define Pj is the j’th eigenvector corresponding to the eigenvalue λj (i.e
PjΣPj = λj) for j = {1, .., m}, P := (P1, .., Pm) and Λ an m × m diagonal matrix
45
Chapter 2
where [Λ]jj = λj. Since Σ is symmetric (by construction) and positive definite Σ
may be eigen-decomposed as Σ = PΛP where P P = Im×m
The j’th PC is defined as Zpc
ij = PjZi. The PC with the largest variance is the
one with the largest eigenvalue. Since V ar(Zpc
ij ) = PjV ar(Zi)Pj = PjΣPj = λj.
Hence the variance of the PC Zpc
ij (which is a linear combination of Zi using as
weights the eigenvector Pj) is λj. See Jolliffe (2002) for a great discussion on the
method of Principal Components.
The method of PC as applied to instrumental variables is to transform the instru-
ment set using the eigenvectors P and rank this transformed set of instruments by
their corresponding eigenvalues. Namely order PCs by the size of their variance.
This is then used as a basis for dimension reduction of the instrument set. Without
loss of generality, the PCs are ordered such that λ1 ≥ λ2... ≥ λm
To implement the PC method in practise requires a consistent estimate of Σ with
which to estimate P (the matrix of PC weighings) and Λ (the diagonal matrix with
the eigenvalues of Σ along the diagonal). Natural estimates can be formed taking
the sample variance of Zi as an estimate of Σ. Namely ˆΣ := 1
n
n
i=1(Zi− ¯Z)(Zi− ¯Z)
where ¯Z := 1
n
n
i=1 Zi.
Σ can be eigen-decomposed as ˆΣ = ˆP ˆΛ ˆP where ˆP ˆP = Im×m and ˆΛ is a diagonal
matrix with sample eigenvalues ˆλj where ˆPj
ˆΣ ˆPj = ˆλj (j = {1, .., m}) along the
diagonal.
When ||ˆΣ − Σ|| = Op(n−1/2) ( sufficient conditions being that Zi is i.i.d where
E[||gi(β0)||2],(i = {1, .., n}) it can be shown that || ˆP − P || = Op(mn−1/2) and
||ˆΛ − Λ|| = Op(mn−1/2) (e.g Bosq (2000)). So long as m2/n → 0 (along with some
regularity conditions) then ˆPj and ˆλj consistently estimate Pj and λj respectively
for j = {1, .., m}
A common sample PC method then estimates Znpc
ij as ˆZpc
ij = ˆPjZi which has
sample variance equal to ˆλj similar to the population analog above. The samples
PCs are then ranked based on the size of the sample eigenvalues (i.e their sample
variance).
2.3.1 Problem With The PC Method of Instrument
Reduction
This section demonstrates the potential flaw with using PC methods as a basis of
instrument reduction. In order to illustrate this problem the linear IV set up in
46
Many Weak Instruments & Normalized Principal Components
Section 2.3 with one endogenous variable an no exogenous variables (i.e p = 1) is
used. The idea extends readily to more than one endogenous variable and with
exogenous controls 5.
xi = π Zi + ηi (2.10)
Where π is an m × 1 vector of first stage coefficients.
Take the population PCs Zpc
i = P Zi (where Zpc
ij = PjZi as defined above). Defin-
ing πpc := P π where πpcj is the j’th element of πpc then
xi = π Zi + ηi = π PP Zi + ηi = πpcZpc
i + ηi (2.11)
Since P P = I and hence P = P−1. So then πpcj is the population coefficient
from a regression of xi on the of the j’th PC. We now derive the variation each
PC explains to the total variation in xi explained by the all PCs (i.e decompose
the total variation in xi explained by the whole instrument set for the PCs) .
V ar(π Zi) = π Σπ = π PΛP π = πpcΛπpc =
m
j=1
π2
pcjλj (2.12)
The PC method then (asymptotically) ranks the transformed instruments Zpc
ij by
λj. However the variation that the j’th PC Zpc
ij contributes to the total variation
of xi explained by all m instruments ( V ar(π Zi)) is π2
pcjλj. If πpcj = 0 then the
j’th PC is irrelevant, irrespective of the size of λj.
In fact basing dimension reduction using a ranking based on the eigenvalues could
give a reverse ranking. Take for example the simple case where πpcj = λ−δ
j ∀
j = {1, ..m} where δ > 1/2. A simple example where the PC ranking provides a
correct ranking of the strength of the PCs is where πpcj = πpci for all i, j ∈ {1, ..m}.
In fact it could be the case that all the variation in xi explained by all PCs lie
within those components with the smallest eigenvalues. Take the extreme example
where πpcj = 0 ∀ j ∈ {1, ..(m − 1)} and πpcm = 0. However in general the PC
with the smallest eigenvalues would be dropped using the PC method of dimension
reduction.
5
m(Zi) from (2.3) is omitted since under assumption it is asymptotically negligible for
m → ∞.
47
Chapter 2
2.4 Normalized Principal Components
A more intuitive way to transform the instrument set with which to form an
ordering of the new instruments is to normalize all the PCs to have equal variance.
Define Znpc
i = Λ−1/2Zpc
i .
Then V ar(Znpc
i ) = Λ−1/2V ar(Zpc
i )Λ−1/2 = Λ−1/2ΛΛ−1/2 = Im×m.
Then Znpc
i are the Normalized Principal Components (NPCs). Define πnpc :=
Λ1/2P π
xi = π Zi + ηi = πnpcZnpc
i + ηi (2.13)
Since πnpcZnpc
i = π PΛ1/2Λ−1/2P Zi = π Zi.
Letting πnpcj denote the j’th element of πnpc then the variation in xi explained the
m NPCs can be expressed as,
V ar(π Zi) = V ar(πnpcZnpc
i ) = πnpcV ar(Znpc
i )πnpcj = πnpcπnpc =
m
j=1
π2
npcj (2.14)
Since π2
npcj is the contribution of the j’th NPC to the total variation in xi explained
by all NPCs. The NPCs may then be ranked in terms of their relevance by the
absolute size of their parameters in the first stage regression. A natural measure
of the strength of NPC j
Sj := π2
npcj/
m
l=1
π2
npcl (2.15)
The proportion of the total variation explained by NPC j. A natural way of then
ordering NPCs is by Sj. Arrange NPCs with Sj from largest to smallest, hence
S1 ≥ S2.. ≥ Sm. The amount of the total variation in xi explained by the whole
set of NPCs (and hence the total variation in the original instrument set) by the
first k NPCs C≤k
C≤k :=
k
l=1
π2
npcl/
m
l=1
π2
npcl =
k
j=1
Sj (2.16)
C≤k is the proportion of the total variation of the endogenous variable explained
the whole instrument set captured by the first k NPCs. This is an extremely useful
tool for visualizing how the informative content in a set of instruments is spread
48
Estimation & Inference with Singular Moments
Estimation & Inference with Singular Moments
Estimation & Inference with Singular Moments
Estimation & Inference with Singular Moments
Estimation & Inference with Singular Moments
Estimation & Inference with Singular Moments
Estimation & Inference with Singular Moments
Estimation & Inference with Singular Moments
Estimation & Inference with Singular Moments
Estimation & Inference with Singular Moments
Estimation & Inference with Singular Moments
Estimation & Inference with Singular Moments
Estimation & Inference with Singular Moments
Estimation & Inference with Singular Moments
Estimation & Inference with Singular Moments
Estimation & Inference with Singular Moments
Estimation & Inference with Singular Moments
Estimation & Inference with Singular Moments
Estimation & Inference with Singular Moments
Estimation & Inference with Singular Moments
Estimation & Inference with Singular Moments
Estimation & Inference with Singular Moments
Estimation & Inference with Singular Moments
Estimation & Inference with Singular Moments
Estimation & Inference with Singular Moments
Estimation & Inference with Singular Moments
Estimation & Inference with Singular Moments
Estimation & Inference with Singular Moments
Estimation & Inference with Singular Moments
Estimation & Inference with Singular Moments
Estimation & Inference with Singular Moments
Estimation & Inference with Singular Moments
Estimation & Inference with Singular Moments
Estimation & Inference with Singular Moments
Estimation & Inference with Singular Moments
Estimation & Inference with Singular Moments
Estimation & Inference with Singular Moments
Estimation & Inference with Singular Moments
Estimation & Inference with Singular Moments
Estimation & Inference with Singular Moments
Estimation & Inference with Singular Moments
Estimation & Inference with Singular Moments
Estimation & Inference with Singular Moments
Estimation & Inference with Singular Moments
Estimation & Inference with Singular Moments
Estimation & Inference with Singular Moments
Estimation & Inference with Singular Moments
Estimation & Inference with Singular Moments
Estimation & Inference with Singular Moments
Estimation & Inference with Singular Moments
Estimation & Inference with Singular Moments
Estimation & Inference with Singular Moments
Estimation & Inference with Singular Moments
Estimation & Inference with Singular Moments
Estimation & Inference with Singular Moments
Estimation & Inference with Singular Moments
Estimation & Inference with Singular Moments
Estimation & Inference with Singular Moments
Estimation & Inference with Singular Moments
Estimation & Inference with Singular Moments
Estimation & Inference with Singular Moments
Estimation & Inference with Singular Moments
Estimation & Inference with Singular Moments
Estimation & Inference with Singular Moments
Estimation & Inference with Singular Moments
Estimation & Inference with Singular Moments
Estimation & Inference with Singular Moments

More Related Content

What's hot

Multiple linear regression II
Multiple linear regression IIMultiple linear regression II
Multiple linear regression IIJames Neill
 
Practice test ch 10 correlation reg ch 11 gof ch12 anova
Practice test ch 10 correlation reg ch 11 gof ch12 anovaPractice test ch 10 correlation reg ch 11 gof ch12 anova
Practice test ch 10 correlation reg ch 11 gof ch12 anovaLong Beach City College
 
Student's T-test, Paired T-Test, ANOVA & Proportionate Test
Student's T-test, Paired T-Test, ANOVA & Proportionate TestStudent's T-test, Paired T-Test, ANOVA & Proportionate Test
Student's T-test, Paired T-Test, ANOVA & Proportionate TestAzmi Mohd Tamil
 
Estimation theory 1
Estimation theory 1Estimation theory 1
Estimation theory 1Gopi Saiteja
 
Analysis of variance (ANOVA)
Analysis of variance (ANOVA)Analysis of variance (ANOVA)
Analysis of variance (ANOVA)Sneh Kumari
 
Characterization of student’s t distribution with some application to finance
Characterization of student’s t  distribution with some application to financeCharacterization of student’s t  distribution with some application to finance
Characterization of student’s t distribution with some application to financeAlexander Decker
 
The chi square test of indep of categorical variables
The chi square test of indep of categorical variablesThe chi square test of indep of categorical variables
The chi square test of indep of categorical variablesRegent University
 
Probability/Statistics Lecture Notes 4: Hypothesis Testing
Probability/Statistics Lecture Notes 4: Hypothesis TestingProbability/Statistics Lecture Notes 4: Hypothesis Testing
Probability/Statistics Lecture Notes 4: Hypothesis Testingjemille6
 
Lesson 8 Linear Correlation And Regression
Lesson 8 Linear Correlation And RegressionLesson 8 Linear Correlation And Regression
Lesson 8 Linear Correlation And RegressionSumit Prajapati
 
Day 12 t test for dependent samples and single samples pdf
Day 12 t test for dependent samples and single samples pdfDay 12 t test for dependent samples and single samples pdf
Day 12 t test for dependent samples and single samples pdfElih Sutisna Yanto
 
Assignment on Statistics
Assignment on StatisticsAssignment on Statistics
Assignment on StatisticsTousifZaman5
 

What's hot (20)

Chapter12
Chapter12Chapter12
Chapter12
 
Chi Squared Test
Chi Squared TestChi Squared Test
Chi Squared Test
 
Chi square test
Chi square testChi square test
Chi square test
 
Multiple linear regression II
Multiple linear regression IIMultiple linear regression II
Multiple linear regression II
 
Chapter08
Chapter08Chapter08
Chapter08
 
Practice test ch 10 correlation reg ch 11 gof ch12 anova
Practice test ch 10 correlation reg ch 11 gof ch12 anovaPractice test ch 10 correlation reg ch 11 gof ch12 anova
Practice test ch 10 correlation reg ch 11 gof ch12 anova
 
Student's T-test, Paired T-Test, ANOVA & Proportionate Test
Student's T-test, Paired T-Test, ANOVA & Proportionate TestStudent's T-test, Paired T-Test, ANOVA & Proportionate Test
Student's T-test, Paired T-Test, ANOVA & Proportionate Test
 
Estimation theory 1
Estimation theory 1Estimation theory 1
Estimation theory 1
 
Analysis of variance (ANOVA)
Analysis of variance (ANOVA)Analysis of variance (ANOVA)
Analysis of variance (ANOVA)
 
Z And T Tests
Z And T TestsZ And T Tests
Z And T Tests
 
Characterization of student’s t distribution with some application to finance
Characterization of student’s t  distribution with some application to financeCharacterization of student’s t  distribution with some application to finance
Characterization of student’s t distribution with some application to finance
 
Causality detection
Causality detectionCausality detection
Causality detection
 
The chi square test of indep of categorical variables
The chi square test of indep of categorical variablesThe chi square test of indep of categorical variables
The chi square test of indep of categorical variables
 
Test for independence
Test for independence Test for independence
Test for independence
 
Probability/Statistics Lecture Notes 4: Hypothesis Testing
Probability/Statistics Lecture Notes 4: Hypothesis TestingProbability/Statistics Lecture Notes 4: Hypothesis Testing
Probability/Statistics Lecture Notes 4: Hypothesis Testing
 
Lesson 8 Linear Correlation And Regression
Lesson 8 Linear Correlation And RegressionLesson 8 Linear Correlation And Regression
Lesson 8 Linear Correlation And Regression
 
Sampling Distributions and Estimators
Sampling Distributions and EstimatorsSampling Distributions and Estimators
Sampling Distributions and Estimators
 
Correlation
CorrelationCorrelation
Correlation
 
Day 12 t test for dependent samples and single samples pdf
Day 12 t test for dependent samples and single samples pdfDay 12 t test for dependent samples and single samples pdf
Day 12 t test for dependent samples and single samples pdf
 
Assignment on Statistics
Assignment on StatisticsAssignment on Statistics
Assignment on Statistics
 

Similar to Estimation & Inference with Singular Moments

Probability distribution Function & Decision Trees in machine learning
Probability distribution Function  & Decision Trees in machine learningProbability distribution Function  & Decision Trees in machine learning
Probability distribution Function & Decision Trees in machine learningSadia Zafar
 
A Logical Language with a Prototypical Semantics
A Logical Language with a Prototypical SemanticsA Logical Language with a Prototypical Semantics
A Logical Language with a Prototypical SemanticsL. Thorne McCarty
 
A Fuzzy Mean-Variance-Skewness Portfolioselection Problem.
A Fuzzy Mean-Variance-Skewness Portfolioselection Problem.A Fuzzy Mean-Variance-Skewness Portfolioselection Problem.
A Fuzzy Mean-Variance-Skewness Portfolioselection Problem.inventionjournals
 
Intro to Model Selection
Intro to Model SelectionIntro to Model Selection
Intro to Model Selectionchenhm
 
Machine Learning and Artificial Neural Networks.ppt
Machine Learning and Artificial Neural Networks.pptMachine Learning and Artificial Neural Networks.ppt
Machine Learning and Artificial Neural Networks.pptAnshika865276
 
2.7 other classifiers
2.7 other classifiers2.7 other classifiers
2.7 other classifiersKrish_ver2
 
Suggest one psychological research question that could be answered.docx
Suggest one psychological research question that could be answered.docxSuggest one psychological research question that could be answered.docx
Suggest one psychological research question that could be answered.docxpicklesvalery
 
Optimistic decision making using an
Optimistic decision making using anOptimistic decision making using an
Optimistic decision making using anijaia
 
Nber Lecture Final
Nber Lecture FinalNber Lecture Final
Nber Lecture FinalNBER
 
Normality_assumption_for_the_log_re.pdf
Normality_assumption_for_the_log_re.pdfNormality_assumption_for_the_log_re.pdf
Normality_assumption_for_the_log_re.pdfVasudha Singh
 
Large Scale Recommendation: a view from the Trenches
Large Scale Recommendation: a view from the TrenchesLarge Scale Recommendation: a view from the Trenches
Large Scale Recommendation: a view from the TrenchesAnne-Marie Tousch
 

Similar to Estimation & Inference with Singular Moments (20)

Probability distribution Function & Decision Trees in machine learning
Probability distribution Function  & Decision Trees in machine learningProbability distribution Function  & Decision Trees in machine learning
Probability distribution Function & Decision Trees in machine learning
 
A Logical Language with a Prototypical Semantics
A Logical Language with a Prototypical SemanticsA Logical Language with a Prototypical Semantics
A Logical Language with a Prototypical Semantics
 
Paper473
Paper473Paper473
Paper473
 
A Fuzzy Mean-Variance-Skewness Portfolioselection Problem.
A Fuzzy Mean-Variance-Skewness Portfolioselection Problem.A Fuzzy Mean-Variance-Skewness Portfolioselection Problem.
A Fuzzy Mean-Variance-Skewness Portfolioselection Problem.
 
Unit3
Unit3Unit3
Unit3
 
Intro to Model Selection
Intro to Model SelectionIntro to Model Selection
Intro to Model Selection
 
Petrini - MSc Thesis
Petrini - MSc ThesisPetrini - MSc Thesis
Petrini - MSc Thesis
 
Machine Learning and Artificial Neural Networks.ppt
Machine Learning and Artificial Neural Networks.pptMachine Learning and Artificial Neural Networks.ppt
Machine Learning and Artificial Neural Networks.ppt
 
Paris Lecture 1
Paris Lecture 1Paris Lecture 1
Paris Lecture 1
 
2.7 other classifiers
2.7 other classifiers2.7 other classifiers
2.7 other classifiers
 
Suggest one psychological research question that could be answered.docx
Suggest one psychological research question that could be answered.docxSuggest one psychological research question that could be answered.docx
Suggest one psychological research question that could be answered.docx
 
Mathematical modeling
Mathematical modelingMathematical modeling
Mathematical modeling
 
Optimistic decision making using an
Optimistic decision making using anOptimistic decision making using an
Optimistic decision making using an
 
Nber Lecture Final
Nber Lecture FinalNber Lecture Final
Nber Lecture Final
 
ecir2019tutorial
ecir2019tutorialecir2019tutorial
ecir2019tutorial
 
Normality_assumption_for_the_log_re.pdf
Normality_assumption_for_the_log_re.pdfNormality_assumption_for_the_log_re.pdf
Normality_assumption_for_the_log_re.pdf
 
Large Scale Recommendation: a view from the Trenches
Large Scale Recommendation: a view from the TrenchesLarge Scale Recommendation: a view from the Trenches
Large Scale Recommendation: a view from the Trenches
 
1607.01152.pdf
1607.01152.pdf1607.01152.pdf
1607.01152.pdf
 
panel regression.pptx
panel regression.pptxpanel regression.pptx
panel regression.pptx
 
Dissertation Paper
Dissertation PaperDissertation Paper
Dissertation Paper
 

Estimation & Inference with Singular Moments

  • 1. Estimation & Inference Under Non-Standard Conditions by Nicky Grant Robinson College Dissertation submitted to the University of Cambridge for the degree of Doctor of Philosophy Supervisor: Professor Richard J. Smith Faculty of Economics c August 2013
  • 2. ii
  • 3. Dedicated entirely to my Nanna Lilly, Without whom little of this would have been possible.
  • 4. ii
  • 5. Declaration I hereby declare that this dissertation is the result of my own work, includes nothing which is the outcome of work done in collaboration except where specifically stated in the text and is not substantially the same as any other work that I have submitted or will be sub- mitting for a degree or diploma or other qualification at this or any other university, and does not exceed the prescribed word limit of 60,000 words. Chapter 2 is a slightly condensed version of the paper published as Grant, Nicky. ‘Overcoming the Many Weak Instrument Problem Using Normalized Principal Components.’ Advances in Econometrics 29 (2012): 107-147. Chapter 3 is a based on a paper co-authored with Richard J. Smith based on an earlier working paper named ‘Estimation & Inference from Uncon- ditional Moment Inequality Restrictions Models Estimated via GMM and Generalized Empirical Likelihood’. Nicky Grant 6th August 2013 iii
  • 6. iv
  • 7. Summary This dissertation studies identification of some unknown parameter from a set of moment conditions, covering both inequality and equality restrictions. Chapter 1 considers identification robust inference from the inversion of the Generalised Anderson-Rubin Statistic (GAR) based on a χ2 m approximation where m is the number of moment conditions. This method is known to pro- vide valid inference under a set of assumptions including the moment variance be non-singular at the true parameter θ0, e.g Stock & Wright (2000). This assumption is shown to be untenable for many forms of identification failure in non-linear models, as noted for a class of regression models in Andrews & Cheng (2012). They overcome the issue of singularity for asymptotic analy- sis by a restrictive assumption that the moment variance be non-singular up to a particular matrix of model parameters. To provide results for general forms of identification failure a novel asymptotic approach is developed based on higher order asymptotic expansions of the eigensystem of the moment variance around θ0. Without reference to an assumption moment variance singularity takes a known form the GAR statistic is shown to possess a χ2 m limit under additional regularity conditions when moments are singular that are currently known in the literature. One such condition requires the null space of the moment variance lie within that of the outer product of the ex- pected first order derivative at θ0. When this condition is violated the GAR statistic is shown to be Op(n) and is termed the ‘moment-singularity bias’. A simulation experiment demonstrates this bias for a IV Linear Simultaneous Equations example. When this condition is almost violated the simulation shows the GAR statistic may be very oversized even for large sample sizes. v
  • 8. Summary Chapter 2 provides a method of ordering and selecting instruments so as to minimise the many weak instrument bias in linear IV settings. A potential flaw of the commonly used Principal Component (PC) method of instrument reduction is demonstrated. In light of this a new method is derived termed ‘Normalised Principal Components’ (NPC). This method provides a set of in- struments with a corresponding asymptotically valid ranking in terms of their correlation with the endogenous variable. This instrument set and ordering is then used to select instruments by minimising the MSE approximations of Donald & Newey (2001). Favourable small sample properties of the IV estimator based on this technique relative to PC methods are demonstrated in a simulation. Finally the NPC method is applied to the Vietnam War Draft IV setup of Angrist & Krueger (1992). Fourteen NPC’s are shown to have a non-zero correlation with education (p < 0.1) and 2SLS(and related) estimators based on such instruments estimate the returns to schooling to be much lower than that of both OLS and 2SLS with all instruments. Chapter 3 studies inference from unconditional moment inequalities, an area of research which is growing in popularity in the econometrics literature. Specifically the properties of a GEL-based estimator of the identified set from a set of moment inequalities are derived. To do so the results presented in Chernozhukov Hong & Tamer (2007) [CHT] based on a GMM type estimator from a set of moment inequalities are extended by dropping the assumption that the weight matrix is (asymptotically) diagonal. This assumption though seemingly innocuous is critical to the results and proofs of this paper. The GEL objective function is then shown on the identified set to be first order asymptotically equivalent to that of GMM with weight matrix equal to the inverse of the sample moment variance. Using this result consistency of the GEL estimator for the identified set and rate of convergence in the Hausdorff Distance along with the requisite regularity conditions are established. vi
  • 9. Acknowledgments Would like to thank my supervisor Richard J. Smith for his help and guidance over the past 4 years. Also my research advisor Hashem Pesaran who has provided useful advice and tips for reading which has enriched the content of this dissertation. I would also like to acknowledge the helpful discussions in the graduate office over the years- especially with Manasa Patnam and Steve Thiele. Also conference participants at the World Congress of the Econometrics Society meetings in Shanghai (2010) and the 11th Advances in Econometrics Conference ‘Essays in Honor of Jerry Hausman’ for useful comments related to some of the material in this dissertation.. I would also like to thank all the participants at my Job Market Presenta- tions where I received a lot of positive and helpful feedback. Namely seminar participants at University of Manchester, University of Pompeu Fabra, Uni- versity of St Gallen, Bilkent University, New Economic School, University of New South Wales and the University of Bristol. Also comments from my practise job market talk at the University of Cambridge and at the Euro- pean Winter Meetings 2012 which greatly enriched the quality of my final job market presentations and paper. In particular Melvyn Weeks, Alastair Hall, Barbara Rossi, Oliver Linton, Majid Al-Sadoon, SeoJeong Lee and Daniel Buncic. Finally I would like to acknowledge the ERSC funding which financed my PhD studies from 2009-2012. vii
  • 10. viii
  • 11. Definitions Statistical Definitions: E[x]- Mathematical expectation of x with respect to the density of x. E[x|y] - Mathematical expectation of x with respect to the distribution of x conditional on y. p → - Convergence in probability d → - Convergence in distribution d →- Weak convergence → - For any deterministic sequence an then an → b denotes b as the deterministic limit of an. d ∼ - Shorthand for ‘ is distributed as’ w.p.a.1- With probability approaching 1 w.p.1 - With probability 1 op(a) - A variable that converges to zero w.p.a.1 when divided by a Op(a) -A variable bounded in probability when divided by a a.s(z) - Refers to ‘almost surely’ with respect to the distribution z Matrix Definitions: Let A,B refer to arbitrary matrices, C an arbitrary vector and a,b two arbi- trary real numbers. A = 0 - All entries of A equal to 0 Rank(A) - Rank of A A−1 -Inverse of A A− - Moore-Penrose Generalised Inverse (AA− A = A) Null(A)- Null Space of A ix
  • 12. Definitions tr(A) - Trace of A ||A|| - Euclidean Norm (tr(A A)1/2 ) B(C, )- A ball around C where > 0 ⊗- Kronecker Product A B = A − B (set subtraction) diag(A)- Diagonal matrix formed from the diagonal entries of A [A]ij- ij th element of A Ia×a- a × a Identity Matrix 0a×a - An a × a matrix of zeroes 0a - a × 1 vector of zeroes a ∧ b = max{a, b} a− = min{a, 0} a+ = max{a, 0} a − = a− Abbreviations p.s.d - Positive Semi-Definite p.d- Positive Definite f.c.r- Full column rank w.r.t- With respect to s.t- Such that iff- If and only if CMT- Continuous Mapping Theorem T- Triangle Inequality CS- Cauchy-Schwartz Inequality M-Markov Inequality UWL - will denote a uniform weak law of large numbers such as Lemma 2.4 of Newey and McFadden (1994) CLT - Lindeberg-Levy Central Limit Theorem. KWLLN- Khinctine Weak Law of Large Numbers MSE- Mean Squared Error x
  • 13. IV-Instrumental Variables GMM-Generalised Method of Moments (G)EL- (Generalised) Empirical Likelihood 2SLS-Two Stage Least Squares OLS- Ordinary Least Squares PC- Principal Components xi
  • 14. xii
  • 15. Contents Declaration iii Summary v Acknowledgments vii Definitions ix 1 Identification Robust Inference with Singular Variance 1 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Identification and Singular Variance . . . . . . . . . . . . . . . 5 1.2.1 Conditional Moments . . . . . . . . . . . . . . . . . . . 6 Case (i): E[∂ρ(θ)/∂θ |z] = ∂ρ(θ)/∂θ . . . . . . . . . . 6 Case (ii) E[∂ρ(θ)/∂θ |z] = ∂ρ(θ)/∂θ . . . . . . . . . . 7 1.2.2 Examples of Singular Variance . . . . . . . . . . . . . . 9 Singular Variance : Null(Ω) ⊆ Null(GG ) . . . . . . . . 9 Singular Variance: Null(Ω) Null(GG ) . . . . . . . . 11 1.3 Matrix Perturbation Theory . . . . . . . . . . . . . . . . . . 12 1.3.1 Asymptotic Eigensystem Expansions . . . . . . . . . . 14 1.4 Generalized Anderson Rubin Statistic with Singular Variance 16 1.4.1 Simulation : Heckman Selection . . . . . . . . . . . . . 19 1.4.2 Moment-Singularity Bias when Null(Ω) Null(GG ) . 21 Simulation : Linear IV Simultaneous Equations . . . . 21 1.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 1.6 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 xiii
  • 16. CONTENTS 2 Overcoming The Many Weak Instrument Problem Using Nor- malized Principal Components 37 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 2.2 Instrument Selection Methods . . . . . . . . . . . . . . . . . . 41 2.2.1 MSE Approximations of Donald & Newey (2001) . . . 42 2.2.2 The Regularization MSE Approach . . . . . . . . . . . 43 2.3 Principal Components Ranking of Instruments . . . . . . . . 45 2.3.1 Problem With The PC Method of Instrument Reduction 46 2.4 Normalized Principal Components . . . . . . . . . . . . . . . 48 2.4.1 Sample NPC Method . . . . . . . . . . . . . . . . . . 49 2.5 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 2.5.1 Simulation Results . . . . . . . . . . . . . . . . . . . . 54 2.6 Application to Angrist & Krueger (1992) . . . . . . . . . . . . 56 2.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 2.8 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 2.9 Appendix A: Implementing NPC Method . . . . . . . . . . . . 69 2.9.1 R Code for NPC Instrument Selection . . . . . . . . . 71 3 GEL-Based Inference with Unconditional Moment Inequal- ity Restrictions 75 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 3.2 Moment Inequality Restrictions . . . . . . . . . . . . . . . . . 77 3.3 GMM and GEL . . . . . . . . . . . . . . . . . . . . . . . . . . 78 3.3.1 GMM . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 3.3.2 GEL . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 3.3.3 Identified Set . . . . . . . . . . . . . . . . . . . . . . . 82 3.4 Set Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 83 3.4.1 GMM . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 3.4.2 GEL . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 Appendix A: Assumptions . . . . . . . . . . . . . . . . . . . . 86 Appendix B: Preliminary Lemmas . . . . . . . . . . . . . . . . 87 xiv
  • 17. Appendix C: Proofs for GMM . . . . . . . . . . . . . . . . . . 90 Appendix D: Proofs for GEL . . . . . . . . . . . . . . . . . . . 93 D.1 GEL Estimator Equivalence . . . . . . . . . . . . . . . 93 D.2 Asymptotics for GEL . . . . . . . . . . . . . . . . . . . 95 Appendix E: Identified Set . . . . . . . . . . . . . . . . . . . . 100 Bibliography 105 xv
  • 18. xvi
  • 19. List of Figures 2.1 NPC Decomposition of Proportion of Variation in Education (130 Instruments) . . . . . . . . . . . . . . . . . . . . . . . . . 60 2.2 MSE Approximation Ordered by NPCs(130 Instruments) . . . 61 2.3 NPC Decomposition of Proportion of Variation in Education (360 Instruments) . . . . . . . . . . . . . . . . . . . . . . . . . 64 2.4 MSE Approximation Ordered by NPCs(360 Instruments) . . . 65 xvii
  • 20. xviii
  • 21. List of Tables 1.1 GAR Rejection Probabilities: Heckman Selection . . . . . . . 20 1.2 rGAR Rejection Probabilities: Heckman Selection . . . . . . . 20 1.3 GAR Rejection Probabilities ¯π = 0 (Moment-Singularity Bias) 23 1.4 GAR Rejection Probabilities ¯π = 0.1 (Moment-Singularity Bias) 24 1.5 GAR Rejection Probabilities ¯π = 0.5 (Moment-Singularity Bias) 25 2.1 First Stage PC Coefficients (π2 pcj) . . . . . . . . . . . . . . . . 53 2.2 Eigenvalues of Principal Components(λj) . . . . . . . . . . . . 53 2.3 Variation in xi explained by Principal Components (π2 pcjλj) . . 54 2.4 Simulation Results: NPC vs. PC Instrument Selection . . . . 56 2.5 NPC First Stage Regression Coefficients (130 instruments) . . 59 2.6 NPC First Stage Regression Coefficients (360 instruments) . . 62 2.7 Estimates of Returns to Education . . . . . . . . . . . . . . . 63 xix
  • 22. xx
  • 23. Chapter 1 Identification Robust Inference with Singular Variance This chapter studies identification robust inference when moments have sin- gular variance at the true parameter θ0. Existing robust methods assume non-singular moment variance at θ0 up to a particular known matrix of pa- rameters, Andrews and Cheng (2012). This is shown to restrict the class of identification failure for which current results on robust methods hold. General conditions under which the GAR statistic has a χ2 m limit distribu- tion are derived utilizing second order asymptotic eigensystem expansions of the sample variance matrix around θ0. This method prevents the necessity of restrictive assumptions on the rank and form of the population variance along sequences converging the true parameter. A crucial condition for this result requires that the null space of the moment variance lies within that of the outer product of the expected first order derivative at θ0. When this condition is violated the GAR statistic is Op(n), which is termed the ‘moment-singularity bias’. Empirically relevant examples of this problem are provided and the bias verified in a simulation. Keywords: Generalized Anderson Rubin Statistic, Identification Failure, Singular Variance, Non-linear models, Matrix Perturbation Theory. 1
  • 24. Chapter 1 1.1 Introduction Identification robust methods of inference have gained increasing prominence in the econometrics literature in the last decade. Broadly its objective has been to provide asymptotically valid methods of inference on some unknown parameter θ0 robust to failures of either global or first-order identification. A substantive part of this literature derives confidence sets containing θ0 with asymptotically correct probability inverting a pre-specified test statistic over a parameter space. A large part of this literature has focussed on Linear Instrumental Variable (IV) settings with its roots in the work of Anderson and Rubin (1949). A now sizeable literature has developed providing alternative procedures aiming to make as few possible assumptions to justify asymptotically valid inference on θ0, including but not limited to Kleibergen (2002,2005), Moreira (2003), Chernozhukov & Hansen (2008), Kleibergen & Mavroeidis (2009), Magnus- son (2010), Guggenberger et al (2012). General non-linear moment functions have received relatively little attention in this literature, a notable exception1 being the GAR statistic of Newey and Windmeijer (2009). Also known as the Continuous Updating Estimator (CUE) statistic, Guggenberger, Ramalho and Smith (2005) and confidence regions based on the GAR statistic defined as ‘S-Sets’ in Stock and Wright (2000) . Let wi (i = 1, .., n) be an independent and identically distributed (i.i.d) data set with a known m × 1 moment function g(w, θ) satisfying the mo- ment condition E[g(wi, θ)] = 0 at the true parameter θ0 ∈ Θ ⊆ Rp . Define the sample moment function and corresponding variance matrix respectively ˆg(θ) := 1 n n i=1 gi(wi, θ), ˆΩ(θ) := 1 n n i=1 gi(wi, θ)gi(wi, θ) . The GAR statis- tic is defined ˆTGAR(θ) := nˆg(θ) ˆΩ(θ)−1 ˆg(θ) 1 The K-Statistic of Kleibergen (2005) also permits general non-linear moment func- tions, however the proof of asymptotic validity does not adequately account for singular variance in the transformed moment function considered. This issue is beyond the scope of this chapter, however the author intends to work on this in future research 2
  • 25. Identification Robust Inference with Singular Variance Under a set of assumptions including the asymptotic moment variance Ω := E[g(wi, θ0)g(wi, θ0) ] is non-singular ˆTGAR(θ0) converges in distribution to χ2 m (e.g Stock and Wright (2000)). The majority of the literature on identifica- tion robust inference makes no explicit assumption of first order identifi- cation. Namely that G := E[Gi(θ0)] is full column rank where Gi(θ) := ∂g(wi, θ)/∂θ . The impetus for this chapter stems from the fact that Ω must be singular when G is not full rank for a class of non-linear moment functions, including single equation Non-Linear Least Squares and Maximum Likelihood. This result has mainly gone unmentioned in the identification robust literature. In light of this issue current results in the identification robust literature justify valid inference for a restricted class of identification failure, limited largely to linear models. An exception is (Cheng (2008), Andrews and Cheng (2012)) who note the link between identification failure and singular variance for a particular form of identification failure in semi-linear regression models. Cheng (2008) de- rives the limit distribution of the Non-Linear Least Squares (NLS) estimator for such models. Using this result the distribution of the t, Wald and Quasi- Likelihood Ratio (QLR) statistic are evaluated and methods of identification robust inference are proposed based on such statistics. All three papers overcome the issue of singularity of Ω arising from identifi- cation failure for asymptotic analysis by an assumption that the form of the singularity is known up to a matrix of model parameters. The class of iden- tification failure (and hence singular variance) that satisfy this assumption is shown to be restrictive, being difficult to motivate outside of the particular examples of identification failure studied in both papers. This chapter differs from Andrews and Cheng (2012) in two ways. (i) Condi- tions which the GAR statistic is asymptotically χ2 m are provided for general forms of identification failure requiring no assumptions on the form of mo- ment singularity. (ii) To achieve (i) the GAR statistic is expanded around θ0 via second order asymptotic expansions of the eigensystem of ˆΩ(θ)−1 . This method is of interest in its own right and would prove useful extending results for other identification robust statistics and estimators to allow for general 3
  • 26. Chapter 1 forms of identification failure. Second order asymptotic expansions of the eigenvectors of ˆΩ(θ) around θ0 are derived borrowing results from Matrix Perturbation Theory with its roots in Kato (1982). This field has not readily made it in to the mainstream econo- metric literature- exceptions being Ratsimalahelo (2002) who consider tests of matrix rank, Moon and Weidner (2010) derive expansions of the Quasi Maximum Likelihood profile function for panel data models and Hassani et al (2011) use such expansions for Singular Spectral Analysis. Utilizing this result general second order eigenvalue expansions of ˆΩ(θ) around θ0 are established. Specific expansions under an i.i.d assumption (along with requisite regularity conditions) are then derived. These eigensystem expan- sions will prove useful when extending the results of this chapter to non-i.i.d settings and are new in the identification literature. In order for the result (i) to hold further conditions on Gi(θ) and ˆΩ(θ) at θ0 are required when considering general forms of identification failure. A key condition requires those δ ∈ Rm such that δ Ω = 0 imply δ G = 0 (i.e the null space of Ω is a subset of that of GG ). For example this rules out singu- lar variance when the strong identification conditions hold in just-identified models. In this case the GAR statistic is shown to be bounded in probabil- ity of order n. This issue currently unknown in the literature is termed the ‘moment-singularity bias’. Simulation evidence demonstrates this bias in a Linear IV Simultaneous Equation setup. The small sample approximation of the GAR statistic by a χ2 m distribution is shown to be poor when the null space of Ω almost does not lie within that of GG (i.e when δ Ω ≈ 0 and δ G = 0). In this case the GAR statistic is shown to be oversized even for large sample sizes. Numerous examples of singular variance for commonly used moment func- tions are provided including financial econometric models and Non-Linear IV Simultaneous Equations. Many cases where the assumption on the form of the singularity in Andrews and Cheng (2012) is violated are provided. Section 1.2 explores the relationship between G and Ω for conditional moment restrictions. Section 1.3 sets out the asymptotic approach, deriving second or- der asymptotic expansions of the eigensystem of ˆΩ(θ) and specific expansions 4
  • 27. Identification Robust Inference with Singular Variance in the case wi is i.i.d. Section 1.4 provides conditions under which the GAR statistic is asymptotically locally χ2 m and explains the ‘moment-singularity bias’. An extensive simulation study is also provided demonstrating the main results of this chapter. Section 1.5 presents conclusions and directions for further research. An Appendix collects proofs of the main theorems. 1.2 Identification and Singular Variance The link between identification failure and singular variance is not a new idea in the identification robust literature. Andrews and Cheng (2012) provide asymptotic results under the assumption that there exists B(θ), B(θ) = diag(Im∗×m∗ , ι(θ)I¯m× ¯m) (1.1) Where m∗ = Rank(Ω) and ¯m = m − m∗ , ι(θ) = ||θ|| such that, B(θn)−1 ˆΩ(θn)B(θn)−1 p → ¯Ω (1.2) For all θn = θ0+∆n where ||∆n|| > 0 and ||∆n|| = op(n−1/2 ) and Rank(¯Ω) = m. They derive asymptotic properties of (functions) of general extremum estima- tors working with the transformed moment function B(θn)−1 √ nˆg(θn) where asymptotic singularity of √ nˆg(θn) is eradicated. Once the moment function is transformed the limit variance is non-singular and standard asymptotic analysis is feasible. The existence of such a matrix B(θ) satisfying (1.2) is restrictive, being difficult to motivate generally outside of piecewise linear models with particular forms of identification failure, see Section 1.2.2. Section 1.2.1 studies the relationship between G and Ω more generally from moment conditions derived from a system of conditional moment restrictions. Conditions under which Null(Ω) ⊆ Null(GG ) are derived for general non- linear models with arbitrary forms of identification failure. As demonstrated in Section 1.4.2 this condition turns out to be crucial for ˆTGAR(θn) to be bounded in probability with a χ2 m limit distribution. Empirically relevant examples are given where this condition does not hold in Section 1.2.2. 5
  • 28. Chapter 1 1.2.1 Conditional Moments Consider a J×1 residual function ρ(θ) := ρ(x, θ) where ρ(·, ·) : X×Θ −→ RJ , x ∈ X ⊆ Rl with a h × 1 instrument z satisfying, E[ρ(θ)|z] = 0 at θ = θ0 (1.3) Broadly speaking there are two types of moment function derived from (1.3) depending upon whether E[∂ρ(θ)/∂θ |z] must be estimated beforehand. Namely whether (i) E[∂ρ(θ)/∂θ |z] = ∂ρ(θ)/∂θ a.s(z) for example Non- Linear Least Squares (NLS) and Unconditional Maximum Likelihood (MLE) where x = z. For ML ρ(θ) is the likelihood function and if the moemnt moment condition used to form the GAR is from the score E[∂ρ(θ)/∂θ ] = 0 at θ = θ0 when MLE is correctly specified case the variance equals the Fisher Information Matrix. The issue would also arise in Quasi-Maximum Likeli- hood estimation if V ar(∂ρ(θ)/∂θ ) is singular at the pseudo-true parameter θ = θ∗ . The derivation of the distribution of the (Q)MLE statistic evaluated at points near to θ∗ is beyond the scope of this paper. Alternatively case (ii) E[∂ρ(θ)/∂θ |z] = ∂ρ(θ)/∂θ for z with measure greater than zero, for example non-linear instrumental variables where generally z = x. Case (i): E[∂ρ(θ)/∂θ |z] = ∂ρ(θ)/∂θ Define D(θ, z) := E[∂ρ(θ)/∂θ |z] and Ωρ(θ, z) = E[ρ(θ)ρ(θ) |z]. In the i.i.d setting the optimal instrument is D(θ0, z)Ωρ(θ0, z)−1 , Newey (1993). Take the case J = 1 forming the moment g(θ) = D(θ, z)ρ(θ), Ω = E[ρ(θ0)2 D(θ0, z)D(θ0, z) ] G = E[D(θ0, z)D(θ0, z) ] Hence for any δ ∈ Rp such that δ Ωδ = 0 implies E[ρ(θ0)2 (δ D(θ0, z))2 ] = 0 6
  • 29. Identification Robust Inference with Singular Variance δ D(θ0, z) = 0 a.s(z). Therefore δ Gδ = E[(δ D(θ0, z))2 ] = 0. The reverse is also simple to establish, so that Null(Ω) ≡ Null(G) ≡ Null(GG ). First order under-identification and singular variance are equivalent for single equation NLS. This result may break down for J ≥ 2 if Ωρ(θ0, z) is singular a.s(z) existing cases where the null space of GG and Ω are not equivalent2 . Proposition 1: For g(θ) = D(θ, z)ρ(θ) Null(Ω) ⊆ Null(GG ) iff δ ∈ Rp such that D(θ0, z) δ ∈ Null(Ωρ(θ0), z)/0 a.s(z) Proof For δ = 0, δ Ωδ = E[δ D(θ0, z)Ωρ(θ0, z)D(θ0, z) δ] = 0 iff ∃δ ∈ Rp such that D(θ0, z) δ lies in the null space of Ωρ(θ0, z) a.s(z) since δ G = 0 iff δ D(θ0, z) = 0 a.s(z). Q.E.D Case (ii) E[∂ρ(θ)/∂θ |z] = ∂ρ(θ)/∂θ Commonly when D(θ0, z) is not known a priori the fact that (1.3) implies the following moment condition for any m × 1 Z := (φ1(z), .., φm(z)) where {φj(.) : j = {1, .., m}} are arbitrary functions of z (e.g polynomials in z up to order m), E[ρ(θ) ⊗ Z] = 0 at θ = θ0 For example the Consumption Capital Asset Pricing Model moment condi- tions in Stock and Wright (2000). In this case G = E[D(θ0, z) ⊗ Z] Ω = E[Ωρ(θ0, z) ⊗ ZZ ] Where G is an mJ × p matrix and Ω is mJ × mJ. In this case in general the null space of Ω and GG are not necessarily linked. Given that Z includes 2 Note a similar result can also be shown based utilizing an estimate of the optimal instrument based on an a consistent estimator of a generalized inverse Ωρ(θ0, z)− noting that the Rank(Ωρ(θ0, z)) = Rank(Ωρ(θ0, z)− ). 7
  • 30. Chapter 1 no linearly redundant combinations of instruments then Ω may be less than full rank only when Ωρ(θ0, z) is not full rank a.s(z). Define δ := (δ1, .., δJ ) where δj ∈ Rm for j = {1, .., m}. Proposition 2: Null(Ω) ⊆ Null(GG ) iff δ ∈ RmJ where(δ1Z, .., δJ Z) ∈ Null(Ωρ(θ0, z)) a.s(z) such that δ ∝ ν for some ν ∈ RmJ s.t ν G = 0 Proof: For δ = 0 then δ Ωδ = E[(δ1Z, .., δJ Z)Ωρ(θ0, z)(δ1Z, .., δJ Z) ] hence δ Ω = 0 iff (δ1Z, .., δJ Z) ∈ Null(Ωρ(θ0, z)) a.s(z). The null space of Ω will not lie in that of GG iff ∃δ ∈ RmJ such that (δ1Z, .., δJ Z) ∈ Null(Ωρ(θ0, z)) where δ ∝ ν for some ν ∈ RmJ where ν G = 0. Q.E.D Remarks: (i) When Ωρ(θ0, z) is homoscedastic (i.e Ωρ(θ0, z) = Ωρ a.s(z) for some p.s.d symmetric m × m matrix Ωρ) then it is straightforward to show that Rank(Ω) = mr where r = Rank(Ωρ). (ii)If for any function a(.) of z ∃ π ∈ Rm such that E[(π Z − a(z))2 ] → 0 (1.4) For m → ∞ then Rank(Ω) ≤ mJ − r∗ (as m → ∞) where r∗ = J − Rank(Ωρ(θ0, z)) a.s(z). Since by (1.4) there will exist at least r∗ linearly independent vectors δ ∈ RmJ s.t (δ1Z, .., δJ Z) can be expressed as some linear combination of elements of the null space of Ω(θ0, z) a.s(z) for m large. Especially a concern is (ii) as even if ρ(θ0) has no perfectly correlated (linear combination of) elements (E[Ωρ(z, θ0)] is full rank), Ω will be singular for m large when there exists perfect conditional correlation in elements of ρ(θ0) (i.e r∗ > 0). This would violate the condition for GAR to be asymptotically χ2 m. An example of this case is provided Example 3 in Section 1.2.2 with a corresponding simulation provided in Section 1.4.2. 8
  • 31. Identification Robust Inference with Singular Variance 1.2.2 Examples of Singular Variance This sections provides examples of moment functions with singular variance both with and without identification- specifically when the condition that Null(Ω) ⊆ Null(GG ) holds or does not. Singular Variance : Null(Ω) ⊆ Null(GG ) A class of identification failure satisfying Null(Ω) ⊆ Null(GG ) is the stochas- tic semi-linear parametric equations (for J = 1) considered in Cheng (2008)3 . y = α x + πf(z, γ) + Where θ = (α, γ, π), α ∈ Rq , π ∈ R, γ ∈ Rl and f(·, ·) : Rd × Rl → R is a continuously differentiable function. Let w = (y, x, z) where y is a scalar random variable, x is q ×1 and z is d×1 where E[ |x, z] = 0 at θ = θ0 for some parameter vector θ0 = (α0, γ0, π0). Define f(γ) := f(z, γ), (θ) := y − α x − πf(γ), ∂ (θ) ∂θ = (x, f(γ), π∂f(γ)/∂γ) Then the moment function utilized in NLS is g(θ) = (θ)(x, f(γ), π∂f(γ)/∂γ) Under the i.i.d assumption the variance of the moments at any θ ∈ Θ is Ω(θ) = E (θ)2    xx f(γ)x πx∂f(γ)/∂γ f(γ)x πf(γ)2 πf(γ)∂f(γ)/∂γ π∂f(γ)/∂γx πf(γ)∂f(γ)/∂γ π2 ∂f(γ)/∂γ∂f(γ)/∂γ    Ω would be singular in the following three cases (and potentially others), 3 Cheng (2008) allow for a vector of non-linear functions though for simplicity this special case is highlight to demonstrate the infeasibility of the assumption on the form of the singular variance made in both Cheng (2008) and Andrews and Cheng (2012). 9
  • 32. Chapter 1 (i) θ0 = (α, γ, 0) for any (α, γ) ∈ Rq+l . (ii) f(γ0) = δ x for some δ ∈ Rq . (iii) δ1∂f(γ0)/∂γ = δ2x for some δ1 ∈ Rl and δ2 ∈ Rq . where ||δ1|| > 0. Case (i) falls under the assumption of Andrews and Cheng (2012). Namely for the matrix B(θ) = I2×2 02 02 π then B(θ)−1 Ω(θ)B(θ)−1 is no longer a function of π. In this case singularity cased by π0 = 0 is removed. However there exist no matrix of the form B(θ) that will remove the singularity for cases (ii) and (iii) and more generally for arbitrary forms of singularity that depend upon the Data Generating Process. Example 1: Heckman Selection Consider a Heckman Selection Re- gression where f(z, γ) = φ(z γ)/Φ(−z γ) is the Inverse Mills Ratio and z corresponds to variables which govern sample selection. If z γ0 = c for some constant c and x includes a constant then singularity arises from (ii). Even if this condition does not hold, as noted by Puhani (2000) and others the Inverse Mills Ratio is approximately linear for a wide range of γ. In this case if x and z contain coinciding variables then NLS would be weakly identified with almost singular variance. Example 2 provides a case of a general non-linear moment function where Null(Ω) ⊆ Null(GG ). Also note that in this case there exists no matrix B(θ) satisfying (1.2). Example 2: Interest Rate Dynamics r − r−1 = a(b − r−1) + σrγ Where r−1 is the first lag of the interest rate r. Define θ = (a, b, σ, γ). Under the assumption that is stationary at θ = θ0 where θ0 = (a0, b0, σ0, γ0) then using the test-function approach of Hansen and Scheinkman (1995) the 10
  • 33. Identification Robust Inference with Singular Variance following moment function is derived in Jagannathan and Wang (2002), g(θ) =       a(b − r)r−2γ − γσ2 r−1 a(b − r)r−2γ+1 − (γ − 1 2 )σ2 (b − r)r−a − 1 2 σ2 r2γ−a−1 a(b − r)r−σ − 1 2 σ3 r2γ−σ−1       satisfying E[g(θ)] = 0 at θ = θ0. When σ0 = a0, γ0 = 1/2(a0 + 1) or γ0 = 1/2(σ0 + 1) redundant moments exist at the true parameter. For example if all three conditions held simulta- neously the rank of Ω be 1 as there would exist only one linearly independent comibination in g(θ). Singular Variance: Null(Ω) Null(GG ) Common causes of singular variance arise from a lack of identification. It is however plausible that singular variance occurs where Null(Ω) Null(GG ), for example in just-identified settings when G is full rank (first-order identi- fied) though Ω is singular. Example 3: IV Simultaneous Equations Consider an example of a conditional moment restriction where J = 2, ρ1(θ0) = h1(z) ρ2(θ0) = h2(z) Where E[ 2 |z] = 1 and h1(z) and h2(z) are the conditional heteroscedasticity for equations 1 and 2 respectively. Let Z be an m × 1 vector function of z used as instruments. Let δ = (δ1, δ2) where δ1, δ2 ∈ Rm then δ Ωδ = E[h1(z)(δ1Z)2 ] + E[h2(z)(δ2Z)2 ] + 2E[ h1(z)h2(z)δ1Zδ2Z] For example if δ1Z = 1/ h1(z), δ2Z = −1/ h2(z) then Ω is singular. In the case where h1(z) = h2(z) then any δ1, δ2 ∈ Rm where δ1Z = −δ2Z would 11
  • 34. Chapter 1 yield δ Ωδ = 0. This is an example of Proposition 2 and in general δ Ω = 0 does not imply δ G = 0. Take for example ρ1(θ) = y1 − θ1x1 ρ2(θ) = y2 − θ2x2 Where θ = (θ1, θ2), x = (y1, y2, x1, x2) with instrument vector Z = (1, z). Assuming E[x1|z] = ¯π(1 + z), E[x2|z] = −¯π(1 + z2 ) and z ∼ N(0, 1) it is straightforward to establish, G = ¯π(1, 1) 02 02 ¯π(−2, 0) If h1(z) = h2(z) then δ1 = (c, 0), δ2 = (−c, 0) for c = 0 imply δ Ω = 0 however δ G = (c¯π, 2c¯π) = 0 when ¯π = 0. Note that if instruments were irrelevant (¯π = 0) then δ G = 0 for all directions δ ∈ R4 . Though the example here is somewhat pathological (requiring ρ1(θ0), ρ2(θ0) be perfectly correlated) the problem extends also to the case where no equa- tions are perfectly correlated, i.e h1(z) = h2(z)). For example if h1(z) = exp(−ζ1z) and h2(z) = exp(−ζ2z) (where ζ1 = ζ2) if Z includes polynomial orders of z up to m then δ1 and δ2 such that δ1z = 1+1/2ζ1z +...+(1/2ζ1z)m /m! and δ2z = −(1+1/2ζ2z +...+(1/2ζ2z)m /m!) will well approximate 1/ h1(z) and −1/ h2(z) respectively for m large. When using many instruments (and/or with J large) it is entirely plausible there exist directions in which δ Ω = 0 that do not imply δ G=0. 1.3 Matrix Perturbation Theory Section 1.4 derives conditions under which ˆTGAR(θn) converges in distribution to a χ2 m limit for any local sequence θn = θ0 + ∆n−δ ( hence ∆n = ∆n−δ ) where ∆j = 0 ∀j = {1, .., p}. This is the asymptotic approach in Bottai (2003) and others. This could be generalised to allow ∆ to be a potentially random variable such that n−δ ∆n d → ∆ and allow for different rates of con- 12
  • 35. Identification Robust Inference with Singular Variance vergence. This would however not change or add to the fundamental result in Theorem 1. Crucially we model each parameter as perturbed away from its true value, where the perturbation may be made infinitesimally though never zero (i.e any finite δ > 0 and ∆j = 0 for all j = {1, .., p}). For ex- ample if one parameter in θ0 leads to singularity, then if parameter is not perturbed the matrix will be singular irrespective of perturbations to the remaining parameters. And again, if we can establish that the inverting the GAR statistic with a χ2 m covers an infinitesimally small region around all parameters this ensures it covers θ0. This method is used to establish local uniform coverage in other papers as mentioned for example by Bottai (2003). Even if θ0 were not a point of singularity we would still wish to establish that this local coverage condition and is not an assumption made to deal with po- tential singularity. Nor is it an assumption the true parameter is a drifting sequence- though it could be interpreted this way if desired without loss of generality. The large simple distribution of the GAR is derived without an assumption the form of the singularity is known. To do so the GAR statistic at θn is expanded around the point of singularity θ0, requiring second order expansions of the eigensystem of ˆΩ(θn) around θ0. This section is concerned with deriving these expansions. Firstly definitions for the eigensystem of the functional matrix Ω(θ) and ˆΩ(θ) are outlined. By construction both matrices are p.s.d and symmetric hence the following decompositions can be made for all θ ∈ Θ. Let the m×m matrix P(θ) be the matrix of population eigenvalues where Ω(θ) = P(θ)Λ(θ)P(θ) Such that P(θ) P(θ) = Im and Λ(θ) contains the eigenvalues of Ω across the diagonal and zeros on the off-diagonal. Define the rank of Ω(θ) as m − ¯m(θ) where 0 ≤ m(θ) ≤ m. Express P(θ) = (P+(θ), P0(θ)) and Λ(θ) = Λ+(θ) 0 0 Λ0(θ) where Λ+(θ) is an (m− ¯m(θ))×(m− ¯m(θ)) diag- onal matrix with the non-zero eigenvalues of Ω(θ) on the diagonal with cor- responding eigenvector matrix P+(θ). Λ0(θ) = 0¯m(θ)× ¯m(θ) with corresponding eigenvector matrix P0(θ). Performing an eigenvalue decomposition re-write 13
  • 36. Chapter 1 Ω(θ) as Ω(θ) = P+(θ)Λ+(θ)P+(θ) + P0(θ)Λ0(θ)P0(θ) Performing a similar decomposition for ˆΩ(θ) ˆΩ(θ) = ˆP+(θ)ˆΛ+(θ) ˆP+(θ) + ˆP0(θ)ˆΛ0(θ) ˆP0(θ) Where ˆP+(θ) is an (m − ¯m(θ)) × (m − ¯m(θ)) matrix of sample eigenvector estimates of P+(θ) with corresponding sample eigenvalue ˆΛ+(θ). ˆP0(θ) and ˆΛ0(θ) are similarly the sample estimates of P0(θ) and Λ0(θ) respectively let- ting ˆP(θ) := ( ˆP+(θ), ˆP0(θ)). Define Ω = Ω(θ0) and ˆΩ = ˆΩ(θ0) and ¯m(θ0) := ¯m for notational simplic- ity throughout and let the eigenvalues/vector matrices of both Ω and ˆΩ be defined without θ0, for example P := P(θ0), ˆP := ˆP(θ0) and so on. 1.3.1 Asymptotic Eigensystem Expansions Borrowing results from the Matrix Perturbation literature second order ex- pansions of the eigenvectors of ˆΩ(θn) are derived, Hassani et al. (2011). Using this result second order expansions of the eigenvalues around θ0 are established. These results for the sample moment variance matrix are new in the literature and of interest in their own right. Assumption 1 (A1): General Eigensystem Expansions (i) c ≤ [Λ+]jj ≤ K for some 0 < c ≤ K < ∞ ∀ j = {1, .., ¯m}, (ii) ||ˆΩ(θ) − ˆΩ(θ∗ )|| ≤ ˆM||θ − θ∗ || ∀θ, θ∗ ∈ Θ for some ˆM = Op(1), (iii) m < ∞ A1(i) is a relatively trivial condition which assumes the non-zero eigenvalues are well separated from zero and bounded. A2(ii) requires an asymptotic Lipschitz condition on the sample variance matrix. A3(iii) is an assump- tion of a finite number of moments which is made for simplicity, all results could readily be extended to allow m → ∞ with appropriate rate restrictions relative to n. 14
  • 37. Identification Robust Inference with Singular Variance Define Ω+ = P+Λ+P+ and Ω∗ + = P+Λ−1 + P+ Theorem 1 (T1): General Eigensystem expansions of Under A1,A2 ˆP+(θn) = P+ + Op(||ˆΩ − Ω|| ∧ ||∆n||) (1.5) ˆΛ+(θn) = Λ+ + Op(||ˆΩ − Ω|| ∧ ||∆n||) (1.6) ˆP0(θn) = P0 − Ω∗ + ˆΩ(θn)P0 + Op((||∆n|| ∧ ||ˆΩ − Ω||)2 ) (1.7) ˆΛ0(θn) = P0 ˆΩ(θn)P0 − P0 ˆΩ(θn)Ω∗ + ˆΩ(θn)P0 + Op((||∆n|| ∧ ||ˆΩ − Ω||)3 ) (1.8) Second order expansions for the eigenvectors/values corresponding to non- zero eigenvalues are also provided in Lemma A2. As shown in Section 1.4 second order terms in ˆΛ+(θn), ˆP+(θn) do not enter first order asymptotics for ˆTGAR(θn) these results are omitted here for brevity. Theorem 2 provides expansions of the eigensystem of ˆΩ(θn) around θ0 under an i.i.d assumption on wi with corresponding regularity conditions. Assumption 2 (A2) : i.i.d Eigensystem Expansions (i) wi(i = 1, .., n) is an i.i.d sequence, (ii) E[||gi||2 ] < ∞,(iii) 1 n n i=1 ||Gi(θ)− Gi(θ∗ )|| ≤ ˆM||θ − θ∗ || ∀θ, θ∗ ∈ Θ where ˆM = Op(1), (iv) E[||Gi||2 ] < ∞, (v)E[gi(θ)] = 0 at θ = θ0. A2(i) is made largely for simplicity, all results could be extended to allow for dependence and heteroscedasticity under further regularity conditions. A2(iii) requires that for n large enough the average of any elements of Gi(θ) is sufficiently continuous. This is a weaker condition than Gi(θ) is continuous, though a sufficient condition for A2(ii) is that Gi(·) satisfies the Lipschitz condition. A2 (ii), (iv) are both required such that the remainder terms in the eigensystem expansions are bounded. For any arbitrary sequence ∆n where ||∆n|| > 0, ||∆n|| = op(n−1/2 ) define ¯∆n = ||∆n||−1 ∆n where ¯∆n p → ∆ where ||∆|| > 0 and is bounded. Define 15
  • 38. Chapter 1 gi := gi(θ0), Gi := Gi(θ0) and the following4 Γ = P0E[Gi∆∆ Gi]P0,Ψ = P0E[Gi∆gi], Φ := Γ − ΨΩ∗ +Ψ . Theorem 2 (T2): i.i.d Eigensystem Expansions Under A1, A2 ˆP+(θn) p → P+ (1.9) ˆΛ+(θn) p → Λ+ (1.10) ||∆n||−1 ( ˆP0(θn) − P0) p → Ω∗ +Ψ (1.11) ||∆n||−2 ˆΛ0(θn) p → Φ (1.12) 1.4 Generalized Anderson Rubin Statistic with Singular Variance This section derives conditions under which the GAR statistic has a χ2 m limit distribution making no assumption on the form of singularity. The GAR statistic ˆTGAR(θ) does not exist at θ = θ0 when Ω is singular. However when Φ is full rank then the GAR statistic exists (w.p.1) since ||∆n||−2 ˆΛ0(θn) = Φ + op(1) by Theorem 2 where (||∆n||−2 ˆΛ0(θn))−1 needs to exist for TGAR(θn) to exist (w.p.1). 5 . Assumption 3 (A3) : Limit Distribution of GAR Statistic (i) Null(Ω) ⊆ Null(GG ), (ii) Φ is p.d. A3(i) is a crucial condition needed for the GAR statistic to have the stan- dard χ2 m limit distribution. Note that this assumption always holds for NLS 4 For simplicity w omit dependence of Γ, Ψ on the arbitrary limit ∆. 5 Note that this is not an assumption that the true parameter is a sequence converging to θ0 at some rate, merely that we are evaluating the distribution of TGAR(θ) at points arbitrarily close to θ0. Using these results the true parameter could be modeled as some sequence converging to a limit θ0 which is commonly used to model certain forms of weak- identification in the literature, for example Stock and Wright (2000), Andrews and Cheng (2012). 16
  • 39. Identification Robust Inference with Singular Variance where J = 1 by the results in Section 1.2.1. When A3(i) is violated the GAR statistic in general is Op(n) as shown in Theorem 4. This is termed the ‘moment-singularity bias’ in Section 1.4.2. A3(ii) is required for ˆTGAR(θn) to exist w.p.1 when the function does not exist at θ0 due to a singularity in Ω. Φ = P0(E[Gi∆∆ Gi]−E[Gi∆gi]Ω∗ +E[gi∆ Gi])P0 is p.s.d and in will not be p.d unless P0gi(θ) = 0 for θ = θn. Note that by definition of θn is a perturbation to each element of θ0. Singular variance usually occurs at a point, if Ω(θ0) is singular then Ω(θn) where θn perturbs every element of θ0. For example in y = β0xγ0 + then singularity in the mo- ment variance occurs at β0 = 0. At β0 = 0 the moment variance is singular for all γ ∈ R. However if we perturb β0 by some small amount the moment variance is non-singular at this point for any γ. It is difficult to think of examples where singularity exists at a space within B with volume greater than zero. Example 1,2 and 3 all have singular variance occurring at a point θ0 where the variance is non-singular at some perturbation away from θ0. Theorem 3 (T3): Under A1, A2,A3 ˆTGAR(θn) d → χ2 m (1.13) Remarks (i) Note that in the standard case where Ω is assumed to be non-singular, A2 (iii), (iv) and A3 are not made. In this case all that is required to establish (1.13) is √ nˆg(θn) d → N(0, Ω) which holds under A2(i),(ii) and that ˆΩ(θ) is (asymptotically) continuous around θ0 which follows from A1(ii). It is then straightforward to show that ˆTGAR(θn) d → χ2 m . (ii) When Ω is singular second order terms in the eigensystem expansions of ˆΩ(θn)−1 enter first order asymptotics. As such second order terms in √ nˆg(θn) impact first order asymptotics, requiring further regularity conditions on the first order derivative. (iii) Though theoretically ˆTGAR(θn) is asymptotically χ2 m for θn arbitrarily close but not equal (element by element ) to θ in practise a regularization may need to be used. When Ω is singular then ˆΩ(θn) has smallest eigenvalues 17
  • 40. Chapter 1 of order Op(||∆n||2 ). Take ∆ = c/n for c = 0 then for large sample sizes computationally using numerical software number od order 10−x for x over a certain threshold are rounded down to zero. In this case we the GAR statistic evaluated using numerical software returns a warning that the statistic does not exist. GAR with a regularised estimate of the variance ˆΩ∗ (θ) which drops those below some vanishing threshold would overcome this problem if the threshold was selected large enough to not encounter the precision error. Regularised GAR Staristic (rGAR) In order to overcome the practical issue of the GAR statistic not existing due to rounding imprecision of infinitesimally small numbers a regularised statistic may be preferred in practise. Define J = {1, .., ¯m} i.e those j such that λj > 0 and Jc the compliment of J. Given |∆n|| = op(n−1/2 ) then by Theorem 2 nˆλj(θn) = op(1) for j ∈ Jc and nˆλj(θn) p → ∞ for j ∈ J. Then we can estimate J w.p.1 since Pr{nˆλj(θn) > K} → 1 for all j ∈ J where Pr{nˆλj(θn) > K} → 1 for j ∈ J. In practise the practitioner sets the threshold K and estimate J by ˆJ := {j ∈ (1, .., m) : nˆλj(θn) > K} then it is straightforward to show by the results above that Pr{ˆJ = J}. We can then regularize the variance matrix ˆΩ∗ (θn) = |ˆJ| j=1 ˆPj(θn) ˆPj(θn) ˆλj(θn) where |ˆJ| is the dimension of ˆJ which is our estimate of ¯m, the rank of Ω based on the regularisation. T∗ GAR(θn) = nˆg(θn) ˆΩ∗ (θn)ˆg(θn) (1.14) Where n1/2 ˆg(θn) d → N(0, Ω) under A1,A2 and given Pr{ˆJ = J} → 1 then ˆΩ∗ (θn) = ¯m j=1 ˆPj(θn) ˆPj(θn) ˆλj(θn) w.p.1 hence ˆΩ∗ (θn) is full rank w.p.1. since ˆλj > 0 for all j ∈ J. Then ˆΩ∗ (θn) p → Ω and nˆg(θn) ˆΩ∗−1 (θn)ˆg(θn) p → χ2 ¯m. Then inference can be based on rGAR where now the limit distribution is χ2 ¯m not χ2 m. Theorem 3 is confirmed in a simulation based on the Heckman Selection example in Section 1.4.1. In this case the crucial assumption that Null(Ω) ⊆ Null(GG ) holds as this is NLS with J = 1. The GAR statistic when Ω is near singular returns NaN (not a number) some of the time in light 18
  • 41. Identification Robust Inference with Singular Variance of the imprecision of numerical software. As such inference is also provided based on the rGAR where it is confirmed that T∗ GAR(θn) is asymptotically χ2 ¯m where ¯m may be estimated w.p.1 as n tends towards infinite using |ˆJ|. 1.4.1 Simulation : Heckman Selection Consider the setup in Example 1 where y = θ1 + θ2x + θ3 φ(θ4 + θ5x) Φ(−(θ4 + θ5x)) + Where (x, e) are i.i.d and x ∼ N(0, 8)2 and |x ∼ N(0, 1). Setting θ0 = (1, 1, 0.2, 0.1, κ) where θn = (1, 1, 0.2, 0.1, κ + 1/n) for κ = {0, 0.5, 1}, N = {100, 500, 1000, 5000, 50000}. For κ close to zero NLS is poorly identified as the Inverse Mills Ratio is approximately linear for arguments less than 2, Puhani (2000). Rejection probabilities for the event that the GAR function at θn is less than the 90% quantile of a χ2 5 based on R = 10000 simulations are calculated. In brackets the percentage of warnings returned (no number reported indicates no warn- ings returned) in R when calculating the inverse of the variance matrix are reported. When κ = 0 and hence moments are completely unidentified with exactly singular variance then though GAR exists at θ0 + 1/n , though the smallest eigenvalue is Op(|∆n||2 ) and due rounding approximations used by computational software may yield zero eigenvalues in practise. The rejection frequency depends crucially on the variation in x. If x has small variability then Ω is close to rank 3 and at κ = 0 when perturbed by 1/n then the GAR statistic evaluated in numerical software returns NaN the majority of the time. This is because in this example the rank of Ω is very close to 3 given φ(θ4+θ5x) Φ(−(θ4+θ5x)) ≈ θ4 + θ5x unless x has high variability. Using R software using only double bit precision the GAR statistic returns a warning message quite a high proportion of times. This example considered x with a high variation, where at κ = 0 then Ω is rank 4 and a NaN is still encountered at a high frequency for n large as evidenced in Table 1.1 below. To combat this a regularisation would be required in practise. Table 2 provides inference 19
  • 42. Chapter 1 based on the rGAR statistic with K = 0.001. This negates the NaN issue and coverage is still asymptotically correct. In practise, especially with high dimensional moment problems many zero or almost zero eigenvalues in Ω a regularisation will be necessary. Table 1.1: GAR Rejection Probabilities: Heckman Selection κ = 0 κ = 0.5 κ = 1 n = 100 0.088 0.074 0.071 n = 500 0.087 0.092 0.093 n = 1000 0.09 0.096 0.095 n = 5000 0.088(2%) 0.098 0.095 n = 50000 0.092(41.3%) 0.92 0.96 As seen in Table 1.1 for n large then θn is close the point of singularity and for n large enough the NaN warning will be returned with increasingly high frequency. One method to overcome this in smaller dimension problems with would be to use high precision arithmetic to overcome the issue of rounding. Table 2 repeats the analysis in Table 1.1 but now based on the rGAR statistic as outline above. Table 1.2: rGAR Rejection Probabilities: Heckman Selection κ = 0.05 κ = 0.5 κ = 1 n = 100 0.088 0.083 0.075 n = 500 0.098 0.094 0.099 n = 1000 0.096 0.097 0.098 n = 5000 0.101 0.096 0.099 n = 50000 0.097 0.095 0.096 20
  • 43. Identification Robust Inference with Singular Variance 1.4.2 Moment-Singularity Bias when Null(Ω) Null(GG ) A3(i) is critical in the proof of Theorem 3. When this condition is violated- with examples given in Section 1.2.2 in general ˆTGAR(θn) is unbounded in probability. Theorem 4 (T4) : Under A1, A2, A3(ii) ˆTGAR(θn)/n p → ∆ G P0Φ−1 P0G∆ (1.15) Where ∆ G P0Φ−1 P0G∆ > 0 since Φ is full rank by A3(ii). Hence the GAR statistic is Op(n) when A3(ii) is violated. When A3(i) is almost violated the GAR statistic is shown in the simulation below to be potentially very oversized even for large sample sizes. Theorem 4 is particularly striking as it implies there exist cases of correctly specified moments which strongly identify θ0 where identification robust in- ference based on the GAR statistic would (asymptotically) yield the empty set. This would usually regarded as a sign of moment misspecification. Simulation : Linear IV Simultaneous Equations Consider Example 3 where y1 = x1 + 1 y2 = 0.5x2 + 2 x1 = ¯π(1 + z) + η1 x2 = −¯π(1 + z2 ) + η2 η1 = υ1 exp(−ζ1z), η2 = υ2 exp(−ζ2z) υ1 = 1 + ρ 2 ζ1 + 1 − ρ 2 ζ2, υ2 = 1 + ρ 2 ζ1 − 1 − ρ 2 ζ2 21
  • 44. Chapter 1 (υ1, υ2, 1, 2) |z i.i.d ∼ N(04, Ξ) Ξ =       1 0 0.3 0 0 1 0.5 0 0.3 0 1 0 0 0.5 0 1       For each ¯π = {0, 0.1, 0.5} (uncorrelated, weak, strong) instruments the fol- lowing simulation is performed. For instrument sets I1 = {1, z}, I2 = {1, z, z2 } , I3 = {1, z, z2 , z3 } which respectively yield m = {4, 6, 8} mo- ments rejection probabilities are formulated for the GAR statistic based on a the 0.9 quantile of the relevant χ2 m based on 5000 repetitions where θn = (1, 1/2) + 1/n for z i.i.d ∼ N(0, 1) n = {100, 500, 1000, 5000, 50000}, ρ = {0.9995, 0.999995, 1} (ζ1, ζ2) = {(0, 0), (0, 0.5), (0, 1)}. When ¯π = 0 the condition Null(Ω) ⊆ Null(GG ) is automatically satisfied, in which case the GAR statistic should have a rejection probability around 0.1 for large sample sizes and is verified in Table 1.2. For brevity only the case ζ1 = ζ2 = 0 is reported, similar results were found for both other cases. When ¯π = 0 then when Ω is singular in directions G does not vanish the GAR statistic is in general oversized (i) When ρ = 1 and ζ1 = ζ2 = 0 then Ω is singular as shown in Example 3 δ Ω = 0 implies δ G = 0 if and only if ¯π = 0. The stronger the instruments (the larger is ¯π) the more oversized the rejection probability for any m. (ii) When ρ = 1 and ζ1 = ζ2 then Ω approaches a singular matrix as m increases. Fixing ζ1 = 0 and let ζ2 equal 0.5 and 1. The larger is ζ2 the less well that any m polynomials of z can approximate exp(ζ2/2z) (i.e h2(z)−1/2 from notation in Example 3). The GAR rejection probability is decreasing in ζ2 for any given m, ¯π and increasing in both m and ¯π. (iii) When ρ < 1 then Ω is full rank, however the closer ρ is to 1 in general the larger the GAR statistic as ¯π increases. Even for large sample sizes the rejection probabilities can be very close to 1. Table 1.3 shows the rejection probabilities for the weak instrument case. As expected when ρ = 1 and ζ1 = ζ2 = 0 the rejection probabilities converge to 1 as n increases (since GAR is unbounded in this case for any m). For ρ = 0.999995 and 0.9995 the rejection probabilities for any n,m are smaller then when ρ = 1 however still oversized in small samples. 22
  • 45. Identification Robust Inference with Singular Variance Table 1.3: GAR Rejection Probabilities ¯π = 0 ρ = 0.9995 ρ = 0.99995 ρ = 1 m = 4 m = 6 m = 8 m = 4 m = 6 m = 8 m = 4 m = 6 m = 8 ζ1=ζ2=0 n = 100 0.099 0.080 0.074 0.090 0.092 0.077 0.099 0.904 0.074 n = 500 0.099 0.099 0.097 0.095 0.093 0.087 0.101 0.094 0.084 n = 1000 0.010 0.102 0.0891 0.097 0.103 0.096 0.098 0.094 0.09 n = 5000 0.098 0.093 0.103 0.093 0.106 0.104 0.101 0.097 0.099 n = 50000 0.010 0.102 0.096 0.098 0.101 0.098 0.102 0.091 0.102 As ζ2 increases then in general the rejection probabilities decrease for any ρ as for any given m the instrument set less well approximate the null space of Ω(z, θ0). As m increases the rejection probabilities increase. This pattern is again observed in Table IV for strong instruments. In this case the rejection probabilities for any given n, m, ρ, ζ2 is relatively more oversized in general than when ¯π = 0.1. This corresponds to the fact the condition Null(Ω) ⊆ Null(GG ) is potentially more strongly violated in this case. 23
  • 46. Chapter 1 Table 1.4: GAR Rejection Probabilities ¯π = 0.1 ρ = 0.9995 ρ = 0.999995 ρ = 1 m = 4 m = 6 m = 8 m = 4 m = 6 m = 8 m = 4 m = 6 m = 8 ζ1=ζ2=0 n = 100 0.135 0.123 0.198 0.428 0.38 0.8 0.492 0.421 0.867 n = 500 0.11 0.114 0.132 0.727 0.724 0.998 0.995 0.996 1 n = 1000 0.106 0.1 0.12 0.628 0.6412 0.992 1 1 1 n = 5000 0.091 0.103 0.095 0.251 0.253 0.599 1 1 1 n = 50000 0.092 0.1 0.108 0.117 0.11 0.15 1 1 1 ζ1=0ζ2=0.5 n = 100 0.117 0.118 0.36 0.204 0.292 0.8 0.218 0.329 0.85 n = 500 0.102 0.107 0.267 0.124 0.542 1 0.119 0.954 1 n = 1000 0.106 0.103 0.194 0.105 0.461 1 0.104 1 1 n = 5000 0.105 0.098 0.109 0.106 0.196 0.986 0.094 1 1 n = 50000 0.103 0.105 0.103 0.099 0.107 0.278 0.095 0.676 1 ζ1=0ζ2=1 n = 100 0.080 0.107 0.521 0.089 0.234 0.739 0.076 0.263 0.764 n = 500 0.094 0.099 0.623 0.086 0.247 1 0.094 0.314 1 n = 1000 0.087 0.099 0.42 0.099 0.162 1 0.095 0.199 1 n = 5000 0.098 0.088 0.150 0.093 0.102 0.972 0.102 0.096 1 n = 50000 0.101 0.096 0.095 0.099 0.095 0.230 0.104 0.098 1 24
  • 47. Identification Robust Inference with Singular Variance Table 1.5: GAR Rejection Probabilities ¯π = 0.5 ρ = 0.9995 ρ = 0.999995 ρ = 1 m = 4 m = 6 m = 8 m = 4 m = 6 m = 8 m = 4 m = 6 m = 8 ζ1=ζ2=0 n = 100 0.927 0.893 0.999 1 1 1 1 1 1 n = 500 0.495 0.477 0.939 1 1 1 1 1 1 n = 1000 0.317 0.286 0.706 1 1 1 1 1 1 n = 5000 0.145 0.144 0.222 1 1 1 1 1 1 n = 50000 0.104 0.102 0.110 0.530 0.544 0.964 1 1 1 ζ1=0ζ2=0.5 n = 100 0.761 0.895 1 0.988 1 1 0.992 1 1 n = 500 0.292 0.480 1 0.690 1 1 0.713 1 1 n = 1000 0.193 0.283 1 0.404 1 1 0.425 1 1 n = 5000 0.106 0.125 0.642 0.148 1 1 0.148 1 1 n = 50000 0.100 0.106 0.150 0.102 0.340 1 0.107 1 1 ζ1=0ζ2=1 n = 100 0.171 0.707 1 0.200 1 1 0.194 0.996 1 n = 500 0.101 0.277 1 0.097 0.996 1 0.089 0.955 1 n = 1000 0.096 0.182 1 0.091 0.936 1 0.098 0.349 1 n = 5000 0.088 0.109 0.978 0.094 0.306 1 0.092 0.102 1 n = 50000 0.095 0.105 0.220 0.102 0.108 1 0.091 0.102 1 1.5 Conclusion This chapter studies identification robust inference based on the GAR statis- tic with general forms of identification failure. As demonstrated the non- singular variance assumption is inextricably linked to the assumption of first order identification. This issue has largely been overlooked in the identifica- tion literature. A notable exception is Andrews and Cheng (2012) who deal 25
  • 48. Chapter 1 with the singular variance from identification failure under an assumption the form of singular variance is known up to model parameters. In order to study properties of the GAR statistic with singular variance second order expansions of the eigensystem of the moment variance matrix around the true parameter were derived. This asymptotic approach is new in the identification literature and will prove useful for extending results for other identification robust statistics. Without making any identification assumptions (and hence allowing for gen- eral forms of singular variance) the GAR statistic is asymptotically χ2 m under a further set of conditions. Crucially one condition requires the null space of the moment variance matrix lie within that of the outer product of the expected first order derivative matrix. When this assumption is violated the GAR statistic is unbounded. In this case confidence sets based on inverting the GAR statistic would asymptotically yield the empty set. This result is unknown in the literature and is termed the ‘moment-singular bias’ Examples of how this condition could be violated are provided. Roughly speaking this problem can occur when moments are not weakly identified and are perfectly correlated at the true parameter. This chapter models mo- ments as exactly singular, an interesting extension would model moments as weakly-singular. Namely model the smallest eigenvalues as shrinking to zero at some rate, analogous to the weak-instrument methodology for modeling weak identification. Simulation evidence shows that when the condition on the null space of Ω and GG is almost not satisfied that the GAR statistic in general is oversized. The majority of the literature on properties of estimators and identification robust inference make the assumption moments have non-singular variance or singular variance of known form. This chapter is the first step in providing a platform to extend results in other settings without making a non-singular variance assumption, or assumptions on the form of singularity Andrews & Cheng (2012). Examples include dropping the non-singular variance assump- tion for identification robust inference from the GEL objective function made in Guggenberger, Ramalho & Smith (2008). 26
  • 49. Identification Robust Inference with Singular Variance 1.6 Appendix Appendix A1: Auxiliary Lemmas Lemma A1: w.p.1 ˆΛ0 = 0 Proof of Lemma A1: P0ΩP0 = 0 by definition of P0. E[P0gigiP0] = 0 Since P0gigiP0 is p.d then P0gi = 0 a.s(z) Hence P0 ˆΩP0 = 1 n n i=1 P0gigiP0 = 0 So ˆΩP0 = 0 w.p.1 then ˆP0 = P0H w.p.1 for some full rank ¯m × ¯m matrix H since ˆΩ ˆP0 = 0 by definition, hence ˆΛ0 = 0 Q.E.D Lemma A2: Let ˆA and A be two square symmetric matrices of dimension r where Rank(A) = ¯r and || ˆA − A|| = Op( n) for some bounded non-negative sequence n. Eigen-decompose A = RDR where RR = Ir×r and RDR = R+D+R+ +R0D0R0 where D0 = 0¯rׯr and D+ is a full rank diagonal (r−¯r)× (r−¯r) matrix with the eigenvalues of A on the diagonal where 0 ≤ ||D+|| ≤ K for K < ∞. Similarly express ˆA = ˆR ˆD ˆR = ˆR+ ˆD+ ˆR+ + ˆR0 ˆD0 ˆR0. Define B = ˆA − A then it is true that, ˆR+ = R+ − R0R0B R+D−1 + + Op( 2 n) ˆR0 = R0 − R+D−1 + R+BR0 + Op( 2 n) Proof of Lemma A2: This result follows from equations (8),(9) in Hassani 27
  • 50. Chapter 1 et al. (2011) . Q.E.D Also note that By CS||R0R0B R+D−1 + || ≤ ||R0||2 Op(||D+||)Op(||B||) = Op( n) since ||D+|| = O(1) then ˆR+ = R+ + Op( n) follows from Lemma A2 which is used in the proofs of Theorem 1-4. Lemma A3: Under A1,A2 ||∆n||−2 P0 ˆΩ(θn)P0 p → Γ Proof of Lemma A3: ˆΩ(θn) = 1 n n i=1 gi(θn)gi(θn) Taylor expand gi(θn) around θ0 gi(θn) = gi + Gi(¯θn)∆n (1.16) Where ¯θn is a vector between θ0 and θn Define ¯Gi := Gi(¯θn) ˆΩ(θn) = ˆΩ + 1 n n i=1 ¯Gi∆n∆n ¯Gi + 1 n n i=1 gi∆n ¯Gi + 1 n n i=1 ¯Gi∆ngi (1.17) By Lemma A1(i) Pr{P0gi(θ0) = 0} = 1 so that w.p.1 P0 ˆΩ(θn)P0 = 1 n n i=1 P0 ¯Gi∆n∆n ¯GiP0 (1.18) = 1 n n i=1 P0(( ¯Gi−Gi)∆n∆n ¯Gi+Gi∆n∆n( ¯Gi−Gi) )P0+ 1 n n i=1 P0Gi∆n∆nGiP0 28
  • 51. Identification Robust Inference with Singular Variance By repeated application of CS || 1 n n i=1 P0( ¯Gi − Gi)∆n∆nGiP0|| ≤ ||∆n||2 ||||P0||2 1 n n i=1 ||Gi( ¯Gi − Gi)|| ≤ ||∆n||2 ||||P0||2 1 n n i=1 ||Gi||1 n n i=1 || ¯Gi − Gi|| By A2(iii) 1 n n i=1 || ¯Gi − Gi|| = Op(||∆n||) and 1 n n i=1 ||Gi|| = Op(1) by A2 (i),(iv). Since ||P0|| = ¯m < ∞ by A1(iii) || 1 n n i=1 P0(( ¯Gi − Gi)∆n∆nGiP0|| = Op(||∆n||3 ) (1.19) Similarly it can be shown that ||1 n n i=1 P0(( ¯Gi −Gi)∆n∆n ¯GiP0|| = Op(||∆n||3 ) Define ˆΓn = P0 1 n n i=1 Gi ¯∆n ¯∆nGiP0, Γn = P0 1 n n i=1 E[Gi ¯∆n ¯∆nGi]P0, Then by (??) and (1.19) substituted in to (1.18) implies ||∆n||−2 P0 ˆΩ(θn)P0 = ˆΓn + Op(||∆n||) (1.20) Finally to show ˆΓn p → Γ establishing the result As E[ˆΓn] = Γn and by application of CS ||Γn|| ≤ || ¯∆n||2 E[||Gi||2 ] = O(1) (1.21) Where ¯∆n = O(1) and by A2(iv) E[||Gi||2 ] = O(1) Under A2(i) wi(i = 1, .., n) is i.i.d and E[ˆΓn] = Γn → Γ by CMT (since ¯∆n ¯∆n → ∆∆ and Γn is a continuous function of the bounded sequence ¯∆n). An application of the Khinctine Weak Law of Large of Numbers (KWLLN) element by element to ˆΓn then ˆΓn p → Γ and by (1.20) noting that ||∆n|| = op(n−1/2 ) establishes the result. Q.E.D 29
  • 52. Chapter 1 Lemma A4: Under A1, A2 ||∆n||−1 P0 ˆΩ(θn) p → Ψ Proof of Lemma 4: By Lemma A1(i) and (1.17) P0 ˆΩ(θn) = P0 n i=1 ¯Gi∆ngi + P0 1 n n i=1 ¯Gi∆n∆n ¯Gi (1.22) Where ||P0 1 n n i=1 ¯Gi∆n∆n ¯Gi|| = Op(||∆n||2 ) as shown in the proof of Lemma A3 as ||Γ|| = O(1) by (1.21) P0 ˆΩ(θn) = P0 n i=1 ¯Gi∆ngi + Op(||∆n||2 ) (1.23) By CS, ||P0 1 n n i=1 ( ¯Gi − Gi)∆ngi|| ≤ ||P0|| 1 n n i=1 || ¯Gi − Gi||||∆n|||| 1 n n i=1 ||gi|| (1.24) Where 1 n n i=1 ||gi|| = Op(1) by KWLLN under A2(i) and A2(ii) that E[||gi||2 ] = O(1) and 1 n n i=1 || ¯Gi − Gi|| = Op(||∆n||) by A2(iii) so that ||P0 1 n n i=1( ¯Gi − Gi)∆ngi|| = Op(||∆n||2 ). Define ˆΨn := P0 1 n n i=1 Gi ¯∆ngi, Ψn = P0E[Gi ¯∆ngi] then by (1.24) ||∆n||−1 P0 1 n n i=1 Gi∆ngi = ˆΨn + Op(||∆n||) (1.25) Since E[ˆΨn] = Ψn where Ψn is bounded for all n since by CS ||Ψn|| ≤ || ¯∆n||E[||Gi||]E[||gi||] (1.26) Where E[||Gi||] = O(1) E[||gi||] = O(1) by A2 (ii),(iv). By the KWLLN ˆΨn p → Ψn where || ¯∆n|| = O(1) where Ψn → Ψ by CMT establishing the 30
  • 53. Identification Robust Inference with Singular Variance result. Q.E.D Appendix A2: Main Theorems Proof of Theorem 1: Define the following from Lemma A2 ˆA = ˆΩ(θn), A = Ω where B = ˆΩ(θn) − Ω and ||ˆΩ(θn) − Ω|| ≤ ||ˆΩ(θn) − ˆΩ|| + ||ˆΩ−Ω|| by T , ||ˆΩ(θn)− ˆΩ|| = Op(||∆n||) by A1(ii)so that n := ||ˆΩ−Ω||∧||∆n|| where R+ = P+, R0 = P0, ˆR+ = ˆP+(θn), ˆR0 = ˆP0(θn) and D+ = Λ+ then Since ||Λ−1 + ||||P0||||P+||||ˆΩ(θn)− ˆΩ|| = O(1)Op(||∆n||) since m = O(1) by A1 (iii) hence ||P0|| = ¯m = O(1) where 0 ≤ ¯m ≤ m and ||P+|| = m − ¯m = O(1) where ||Λ−1 + || = O(1) by A1(i). Then by Lemma A2 ˆP+(θn) = P+ + Op(||ˆΩ − Ω|| ∧ ||∆n||) (1.27) Establishing (1.5). ||ˆΛ(θn) − Λ|| ≤ ||ˆΩ(θn) − Ω|| (1.28) By Theorem 4.2 of Bosq (2000). Where it has been shown that ||ˆΩ(θn)−Ω|| = Op(||ˆΩ − Ω|| ∧ ||∆n||) establishing (1.6). Now to show (1.7) and (1.8) again using Lemma A2, ˆP0(θn) = P0 − Ω∗ + ˆΩ(θn)P0 + Op((||∆n|| ∧ ||ˆΩ − Ω||)2 ) (1.29) Establishing (1.7). ˆΛ0(θn) = ˆP0(θn) ˆΩ(θn) ˆP0(θn) (1.30) = ( ˆP0(θn) − P0) ˆΩ(θn)( ˆP0(θn) − P0) + P0 ˆΩ(θn)( ˆP0(θn) − P0) +( ˆP0(θn) − P0) ˆΩ(θn)P0 + P0 ˆΩ(θn)P0 31
  • 54. Chapter 1 Where by (1.7) ˆP0(θn) − P0 = −Ω∗ + ˆΩ(θn)P0 + Op(||∆n||2 ) Noting that Ω = Ω+ and by CS ||Ω∗ + ˆΩ(θn)P0|| ≤ ||Ω∗ +||||P0||||ˆΩ(θn) − ˆΩ(θ0|| = Op(||∆n||) since ||Ω∗ +|| = O(1) by A1(i) and P0 ˆΩ(θn) = P0(ˆΩ(θn) − ˆΩ(θ0)) by Lemma A1(i) so that, ( ˆP0(θn) − P0) ˆΩ(θn)( ˆP0(θn) − P0) (1.31) = P0 ˆΩ(θn)Ω∗ + ˆΩ(θn)P0 + Op(||∆n||3 ) P0 ˆΩ(θn)( ˆP0(θn) − P0) (1.32) = −P0 ˆΩ(θn)Ω∗ +(ˆΩ(θn)P0 + Op((||∆n|| ∧ ||ˆΩ − Ω||)3 ) Hence plugging (1.31),(1.32) in to(1.30) ˆΛ0(θn) = P0 ˆΩ(θn)P0 −P0 ˆΩ(θn)Ω∗ + ˆΩ(θn)P0 +Op((||∆n||∧||ˆΩ−Ω||)3 ) (1.33) Which establishes (1.8). Q.E.D Proof of Theorem 2: By (1.7) ˆP+ = P+ + Op(||∆n|| ∧ ||ˆΩ − Ω||) (1.34) ˆΛ+ = Λ+ + Op(||∆n|| ∧ ||ˆΩ − Ω||) (1.35) Where ||ˆΩ − Ω|| = Op(n−1/2 ) by A2(i),(ii) and ||∆n|| = op(n−1/2 ) establishing (1.9),(1.10). By T1 ||∆n||−1 ( ˆP0(θn) − P0) = −||∆n||−1 Ω∗ + ˆΩ(θn)P0 + op(n−1/2 ) (1.36) Since ||∆n||−1 Op((||∆n|| ∧ ||ˆΩ − Ω||)2 ) = op(n−1/2 ) since ||∆n|| = op(n−1/2 ) By 32
  • 55. Identification Robust Inference with Singular Variance the CMT and Lemma A3 ||∆n||−1 Ω∗ + ˆΩ(θn)P0 p → Ω∗ +Ψ establishing (1.11). By (1.8) ||∆||−2 ˆΛ0(θn) = ||∆n||−2 P0 ˆΩ(θ0)P0 (1.37) −||∆n||−2 P0 ˆΩ(θn)Ω∗ + ˆΩ(θn)P0 + op(n−1/2 ) Since ||∆n||−2 Op((||∆n||∧||ˆΩ−Ω||)3 ) = op(n−1/2 ) By Lemma A2 ||∆n||−2 P0 ˆΩ(θ0)P0 p → Γ and by Lemma A3 and CMT ||∆n||−2 P0 ˆΩ(θn)Ω∗ + ˆΩ(θn)P0 p → ΨΩ∗ +Ψ establishing (1.12). Q.E.D Proof of Theorem 3: ˆTGAR(θn) = n ˆP+(θn) ˆg(θn) ˆΛ+(θn)−1 ˆP+(θn)ˆg(θn) (1.38) +n ˆP0(θn) ˆg(θn) ˆΛ0(θn)−1 n ˆP0(θn) ˆg(θn) Using the expansion of ˆg(θn) around θ0 summed across i in (1.15) √ nˆg(θn) = √ nˆg(θ0) + √ n ˆG(¯θn)∆n (1.39) By repeated application of CS, || √ n( ˆG(¯θn)− ˆG(θ0))∆n|| ≤ √ n||∆n|| 1 n n i=1 || ¯Gi −Gi|| = Op(n1/2 ||∆n||2 ) (1.40) By A2 (ii) where ||∆n||2 n1/2 = op(n−1/2 ) hence √ nˆg(θn) = √ nˆg(θ0) + √ n ˆG(θ0)∆n + op(n−1/2 ) (1.41) Firstly establish that n( ˆP+(θn) ˆg(θn)) ˆΛ(θn)−1 ˆP+(θn) ˆg(θn) = n(P+ˆg(θ0)) Λ−1 + P+ˆg(θ0) + op(1) 33
  • 56. Chapter 1 (1.42) By (1.9) ˆP+(θn) = P+ + op(1) and (1.41) ˆP+(θn) √ nˆg(θn) = P+( √ nˆg(θ0) + ˆG(θ0) √ n∆n) + op(1) (1.43) = P+ √ nˆg(θ0) + op(1) (1.44) Since ||P+ ˆG(θ0) √ n∆n|| ≤ n1/2 ||P+|||| ˆG(θ0)|||||∆n|| = n1/2 O(1)Op(1)op(n−1/2 ) = op(1). ˆΛ+(θn) = Λ+ + op(1) by (1.10) and under A1 (i) then Λ−1 + exists so that by CMT ˆΛ+(θn)−1 = Λ−1 + + op(1) (1.45) Together with (1.44) implies (1.42) so that n( ˆP+(θn) ˆg(θn)) ˆΛ(θn)−1 ˆP+(θn) ˆg(θn) d → χ2 ¯m. Since √ nP0 = ˆg(θ0) p → N(0, Λ+) by A2(i),(ii) and the Lindberg-Levy Central Limit Theorem. We now go on to derive the limit distribution of n ˆP0(θn) ˆg(θn) ˆΛ0(θn)−1 n ˆP0(θn) ˆg(θn). Under A1, A2 , A3 it can be shown that ||∆n||−1 √ n ˆP0(θn) ˆg(θn) = P0 √ n( ˆG(θ0)−G) ¯∆n−ΨΩ∗ + √ nˆg(θ0)+op(1) (1.46) By (1.11) ||∆n||−1 ( ˆP(θn) − P0) = −Ω∗ +Ψ + op(1) ||∆n||−1 √ n ˆP0(θn) ˆg(θn) = (−Ω∗ +Ψ + op(1)) √ nˆg(θn) + ||∆n||−1 P0 √ nˆg(θn) (1.47) Where by (1.39) √ nˆg(θn) = √ nˆg(θ0)+op(1) hence (−Ω∗ +Ψ +op(1)) √ nˆg(θn) = −ΨΩ∗ + √ nˆg(θ0) + op(1) To established the first part on the right hand side of (1.46) note that ||∆n||−1 P0 √ nˆg(θn) = P0( ˆG(θn) − G) ¯∆n + op(1) (1.48) 34
  • 57. Identification Robust Inference with Singular Variance Since by Lemma A1 (i) P0 √ nˆg(θ0) = 0 w.p.1. and by A3(i) P0G = 0. By (1.12) ||∆n||−2 ˆΛ0(θn) = Φ + op(1) (1.49) Where Φ is p.d by A3(ii). By CMT and (1.49) (||∆n||−2 ˆΛ0(θn))−1 = Φ−1 + op(1) (1.50) Together (1.46),(1.50) establish that w.p.a.1 n( ˆP0(θn) ˆg(θn)) ˆΛ0(θn)−1 ˆP0(θn) ˆg(θn)) (1.51) = (P0( √ n( ˆG(θ0)−G) ¯∆n−ΨΩ∗ + √ nˆg(θ0))) Φ−1 (P0( √ n( ˆG(θ0)−G) ¯∆n−ΨΩ∗ + √ nˆg(θ0))) Now it can be established that P0( √ n( ˆG(θ0) − G) ¯∆n − Ψ Ω∗ + √ nˆg(θ0)) d → N(0, Φ) (1.52) Define bi = P0((Gi − G) − ΨΩ∗ +gi) and ¯∆n = ∆n/||∆n||. P0( √ n( ˆG(θ0) − G) ¯∆n − ΨΩ∗ + √ nˆg(θ0)) = 1 √ n n i=1 bi (1.53) Where E[ 1√ n n i=1 bi] = 0 E[ 1 n n i=1 bibi] = P0E[Gi ¯∆n ¯∆n Gi]P0 − ΨnΩ∗ +ΩΩ∗ +Ψn (1.54) By A1 (i) that wi is i.i.d and by definition Ψn = P0E[Gi ¯∆ngi] → Ψ since ¯∆n → ∆ where ||∆|| < ∞ (and likewise Γn := P0E[Gi ¯∆n ¯∆nGi]P0 → Γ by CMT) as E[||Gi||2 ] < ∞, E[||gi||2 ] < ∞ by A2(ii),(iv). E[ 1 n n i=1 bibi] → Φ (1.55) 35
  • 58. Chapter 1 As wi is i.i.d then so is bi and Φ then by the Multivariate Lindberg-Levy Central Limit theorem (note technically bi is a function of n, though only through ¯∆n where ¯∆n = ¯∆ + op(1) hence we can appeal to this theorem w.p.1.) P0( √ n( ˆG(θ0) − G) ¯∆n − ΨΩ∗ + √ nˆg(θ0)) d → N(0, Φ) (1.56) Hence (1.51) converges in distribution to χ2 ¯m since both terms on right hand side of (1.38) are orthogonal asymptotically and the sum of the two is asymp- totically χ2 m. Q.E.D Proof of Theorem 4: Divide equation (1.51) by n (noting that P0G = 0 since A3(i) is violated) it is straightforward to establish that ( ˆP0(θn) ˆg(θn)) ˆΛ0(θn)−1 ˆP0(θn) ˆg(θn)) p → ∆ G P0Φ−1 P0G∆ (1.57) By A2(i),(iv) then P0 ˆG(θ0) p → P0G. Since the first term on the right hand side of (1.38) converges to zero in probability when divided by n, then it is straightforward to establish that ˆTGAR(θn)/n p → ∆ G P0Φ−1 P0G∆. Q.E.D 36
  • 59. Chapter 2 Overcoming The Many Weak Instrument Problem Using Normalized Principal Components Abstract Principal Component (PC) techniques are commonly used to improve the small sample properties of the Linear Instrumental Variables (IV) estimator. Carrasco (2012) argue that PC type methods provide a natural ranking of instruments with which to reduce the size of the instrument set. This chapter shows how reducing the size of the instrument based on PC methods can lead to poor small sample properties of IV estimators. A new approach to ordering instruments termed ‘Nor- malized Principal Components’ (NPC) is introduced to overcome this problem. A simulation study shows the favorable small samples properties of IV estimators using NPC methods to reduce the size of the instrument relative to PC. Using NPC evidence is provided that the IV setup in Angrist & Krueger (1992) may not suffer the weak instrument problem. Keywords: Many Weak Instrument Bias, Instrument Selection, Principal Com- ponents. 37
  • 60. Chapter 2 2.1 Introduction The many weak instrument bias for linear IV estimators1 is now widely recognized and understood in the literature. The small sample (higher order) bias of IV estimators are a function of both the size and the strength of a set of instruments. In general this bias is increasing with the size and decreasing in the strength of a set of instruments 2, (Rothenberg (1984), Staiger & Stock (1997), Stock , Wright & Yogo (2002), Hahn & Hausman (2003), Hahn , Hausman & Kuersteiner (2004), Newey & Smith (2004), Chao & Swanson (2005), Hahn, Hausman & Newey (2008), Newey & Windmeijer (2009)). A common instrument reduction technique utilized in IV settings is Principal Com- ponents. The Principal Components method applied to IV models is now well documented in both theoretical and applied research, (Kloek & Mennes (1960), Amemiya (1966), Doran & Schmidt (2006), Winkelreid & Smith (2011), Carrasco (2012), Carrasco & Tchuente (2012)). Doran & Schmidt (2006) consider the small sample properties of dynamic panel GMM estimators. They provide a heuristic argument why dropping those Principal Components (PCs) that explain the least amount of variation within a set of moments could improve small sample properties of GMM. When a subset of PCs are (almost) irrelevant Doran & Schmidt (2006) argue the PC method may be able reduce the dimension of the moments used for estimation with potentially little loss in efficiency. This logic underlies much of the literature utilizing Principal Components to reduce the size of the moments to improve small sample properties of GMM type estimators. Carrasco (2012) and Carrasco & Tchuente (2012) derive higher order Mean Square Error (MSE) approximations similar to Donald & Newey (2001); though for a po- tentially infinite number of instruments. Various regularization methods are con- sidered to invert the sample covariance matrix of the instruments in finite samples. Principal Components and related regularization techniques are considered in both papers. When the size of the instrument set is less than the sample size the MSE approximations in Carrasco (2012) and Carrasco & Tchuente (2012) collapse to a 1 Throughout the chapter when referring to an IV estimator we refer in general to any estimator based on a set of instruments (e.g Two Stage Least Squares (2SLS), Generalized Method of Moments (GMM), Generalized Empirical Likelihood (GEL)) unless specifically stated otherwise. 2 For ease of exposition this chapter will illustrate ideas based on the one endogenous variable, many exogenous variables IV setup. The ideas generalize naturally to the case of many endogenous regressors. 38
  • 61. Many Weak Instruments & Normalized Principal Components modified version of the MSE approximations in Donald & Newey (2001). Irrespective of the size of the instrument set, the underlying premise of Carrasco (2012) and Carrasco & Tchuente (2012) is the same as much of the other liter- ature on PC type methods for instrument reduction. Namely that transforming instruments in to their PCs and ranking each PC by their variance provides a good ranking of these transformed instruments in terms of their correlation with the endogenous variable. Though at first seemingly intuitive, the premise of PC methods when applied to reducing the size of the instrument set can be crucially flawed. PC methods generally reduce the dimension of the instrument set by keeping those PCs with the largest variance (i.e those linear combinations of the instruments that explain most of the variation within the instrument set). This chapter demonstrates how those PCs that explain most of the variation within the instrument set need not explain any of the variation in the endogenous variable. In fact it is possible that the total variation of the endogenous variable explained by the total variation of instrument set lies in those PCs with the smallest variance. However these are exactly the PCs that are dropped using PC methods of instru- ment reduction. Hence it is entirely plausible that selecting instruments based on PC methods may lead to poor small sample properties of IV estimators. This sit- uation could arise even when there exist linear combinations of instruments with a strong correlation to the endogenous variable. As such an adapted method of in- strument ordering is derived to overcome this problem. This method of instrument dimension reduction is termed ‘Normalized Principal Components ’ (NPC). NPC transforms the instrument set in such a way that these transformed instru- ments may be ranked by the amount of variation of the endogenous variable that they explain. To do this NPC normalizes all PCs to have equal variance. NPC then estimates parameters from a least squares regression of the endogenous variable on the NPCs. The square of these estimated parameters (under regularity conditions) are shown to form a consistent ranking of the variation each corresponding NPC explains of the total variation in the endogenous variable. This method provides a natural and clear way to order a set of instruments in terms of their strength. Using this ranking instruments may be selected efficiently in some way to improve the small sample properties of the IV estimator. For example an ad-hoc rule could be adopted, e.g selecting all NPCs that have t- values in the first stage regression above a certain threshold. NPCs also provide a 39
  • 62. Chapter 2 natural ordering of instruments with which to minimize the MSE approximations of Donald & Newey (2001). This is useful as Donald & Newey (2001) point out the practical use of MSE approximations with large instrument sets are limited without some a priori ranking of instrument strength. In order to implement NPC in practice a precise estimate of both the variance matrix of the instruments and the first stage parameters from a regression of the endogenous variable on the NPC are required. When the sample size is small relative to the number of instruments these estimates may be imprecise. The NPC method of ranking of instruments may be poor in this case. However the PC method of Carrasco (2012) and similar suffer this drawback also. In that PC methods rely upon a precise estimate of the covariance matrix of the instruments. PC methods also suffer the further drawback that ordering PCs by their variances (eigenvalues) may be a poor indicator of their correlation with the endogenous variable even asymptotically. There exist many classic examples of many weak instrument problems where the sample size far outweighs the number of instruments. For example the returns to education data of Angrist & Krueger (1992) that uses the Vietnam War Lottery dummies as in instrument for education attainment. The sample size is over 25000 with only 130 instrumental variables. Another example is the Angrist & Krueger (1991) wage-education data using Quarter of Birth as instrument with a sample size of over 300,000 and 180 instruments. With such data sets one may expect precise estimates of the covariance matrix of the instruments and the relevant first stage parameters. It is widely regarded in the literature that the Vietnam War Lottery as an instru- ment for educational attaintment is weak (Bound, Jaeger & Baker (1995), Angrist & Krueger (1995)). However using the NPC method applied to this set of in- strument it is in fact demonstrated that there exist to 14 statistically significant (p < 0.1) linear combinations from the total 130 instruments considered in Angrist & Krueger (1992). As such this IV problem may not suffer the weak instruments problem. The poor small sample properties in Angrist & Krueger (1992) may be due to the poor small sample properties of 2SLS with many instruments. Various estimators based on a reduced set of instruments selected by using various criterion based on the NPC ordering estimates a return to education much lower than the corresponding OLS estimator in Angrist & Krueger (1992). This conforms with our a priori notion of the sign of the bias in wage-education regressions; unlike 40
  • 63. Many Weak Instruments & Normalized Principal Components 2SLS based on all instruments which estimates a return to education much larger than that of OLS. A simulation experiment compares the estimation error in forming an IV estimator based on an ordering of instruments using the NPC ranking relative to that of PC. Both the NPC and PC method are utilised as a basis with which to minimize small sample MSE approximations of Donald & Newey (2001). The simulation study demonstrates the favorable small sample properties of IV based on ranking PCs by the NPC method as opposed to that of PC. The PC approach to selecting instruments is shown in some cases to yield IV estimators with extremely poor small sample properties when the PC ranking of instruments is poor. Section 2.2 recaps the literature on instrument selection. Section 2.3 details the potential problem of PC and shows how the NPC method of ranking instruments overcomes this. Section 2.4 presents a small simulation study to show the perfor- mance of NPC as an ordering of instruments to minimize the MSE of Donald & Newey (2001) relative to the PC methods considered in Carrasco (2012). Section 2.5 applies the NPC method of choosing instruments to Angrist & Krueger (1992). Section 2.6 provides concluding remarks. Proofs of Lemmas along with details on how to practically implement the NPC method detailed in the chapter (along with R code) are collected in to an Appendix. 2.2 Instrument Selection Methods The poor small sample properties of IV estimators with many (weak) instruments is now well documented in the literature. In light of this problem, a thriving area of research considers methods to select instruments to reduce this many (weak) instrument bias, (Hall, Rudebusch & Wilcox (1996), Shea (1997), Donald & Newey (2001), Hall & Peixe (2003), Donald, Imbens & Newey (2009), Kuersteiner & Okui (2010), Carrasco (2012)). The literature on methods to select instruments is now vast. This chapter focuses mainly on the methods derived in Donald & Newey (2001) and Carrasco (2012); providing a compact review of the instrument selection techniques from each paper. Donald & Newey (2001) derive an approximation to the small sample mean squared error of the linear homoscedastic IV estimator as a function of any given instrument set. 41
  • 64. Chapter 2 The MSE type approximations of Donald & Newey (2001) are sketched in Section 2.2.1. Essentially these expansions provide an approximation to the small sample MSE of various estimators based on a given set of instruments. As such they provide a natural criterion with which to select instruments to efficiently reduce the dimension of the instrument set. 2.2.1 MSE Approximations of Donald & Newey (2001) Donald & Newey (2001) [DN] consider the following linear IV model. yi = xiβ + i (2.1) xi = f(zi) + ηi (2.2) Where xi is a px1 vector of endogenous/exogenous variables and zi is a m∗ × 1 vector of variables such that E[ηi|zi] = 0, E[ 2 i |xi] = σ2 ,E[ηi i|zi] = ση . Where ση is a p × 1 vector and σ2 > 0. f(·) is a p × 1 function of the excluded exogenous variables zi. An mx1 vector of instruments Zi = φ(zi) can be formed where φ(·) is some m × 1 function of the exogenous variables zi. Common examples include polynomials of zi. xi = ΠmZi + m(Zi) + ηi (2.3) Where m(Zi) is a p × 1 vector containing the approximation error for a given instrument set Zi (ie m(Zi) = f(zi) − ΠmZi). The assumption is that as m → ∞ some linear p × m linear combination Πm of Zi approximate f(zi) with arbitrary precision ( i.e E[||m(Zi)||2] → 0 as m → ∞ at some rate). Define Z = (Z1, .., Zn) . For every set of instrument set of size m, the asymptotic variance is estimated by the usual estimate of the Semiparametric Lower Bound (SPLB) and the bias is estimated as some function of m and Z. DN derive higher order expansions for the 2SLS, Limited Information Maximum Likelihood (LIML) and Bias Corrected 2SLS. For brevity only the MSE approximation for 2SLS is detailed. For each of the estimators under certain restrictions on the growth rate of m relative to the n the MSE is of the form n(ˆβ − β0)(ˆβ − β0) = σ2 H−1 + S(m) + r(m) (2.4) 42
  • 65. Many Weak Instruments & Normalized Principal Components Where S(m) varies for the different estimators, r(m) is asymptotically negligible and σ2H−1 is the asymptotic variance (SPLB) where H := E[f(zi)f(zi) ]. For 2SLS under the assumption that m2/n → 0 along with other regularity con- ditions DN show that the above approximation holds with, S(m) = H−1 (ση ση m2 n + σ2 f (I − P)f n )H−1 (2.5) Where P := Z(Z Z)−1Z is the projection matrix of the instruments Z, f := (f(z1), .., f(zn)) . In practice σ2H−1 (the asymptotic variance of 2SLS) and S(m) (the higher order bias) are estimated using their sample counterparts, see Donald & Newey (2001) for a discussion. Appendix A in Section 2.9 provides the R code for the estimating σ2H−1 + S(m)3. The number of instrument sets over which to choose increases exponentially with m and becomes computationally challenging. Optimizing over such a large set of potential instruments combination may also lead to unstable estimates and give poor second stage estimators. In light of this DN argue an a priori notion of which instruments are strongest is required to reduce the dimension of the discrete optimisation problem. This knowledge is not often known or predicted by theory in general. The regularization approach of Carrasco (2012) and Carrasco & Tchuente (2012) provide one such ranking based on Principal Components and related techniques. They also generalize the expansions of DN (2001) for 2SLS and LIML for the case where m > n and possibly infinite. 2.2.2 The Regularization MSE Approach Carrasco (2012) consider the same linear IV setup as in DN detailed in Section 2.1. Carrasco (2012) consider the case where m may be greater than n and possibly in- finite. When faced with an infinite number of instruments (or more generally when m > n) the problem is ill posed since the sample covariance of the instruments is singular. Carrasco (2012) use various regularization methods to approximate the sample covariance matrix of the instruments and generalize the MSE approxima- 3 Specifically the code allows one to perform the NPC ranking detailed in Section 2.6 and estimates the MSE expansions of DN evaluated as a function of NPCs. It would be easy to modify the code and evaluate the MSE approximations with a different form of instruments. 43
  • 66. Chapter 2 tions in DN (2001) for 2SLS. Carrasco & Tchuente (2012) provide similar results for the LIML estimator. Carrasco (2012) use a regularized inverse of the instrument sample covariance matrix. The intuition is highlighted for the Principal Components and Spectral Cut-Off regularization. Define Kn as the (potentially infinite dimensional) sample covariance matrix of Z, λjn the j’th sample eigenvalue and φjn the corresponding sample eigenvector ( <, ., > is the inner product with respect to the Euclidean Norm) such that for any vector r (conformable with the dimension of Kn) , K−1 n r := ∞ j=1 1 λjn <r, φjn> φjn (2.6) In finite samples a truncation of the (infinite dimensional) sample covariance ma- trix is required to form a problem which is tractable and that yields an asymptot- ically valid approximation. The different truncations used correspond to the differing methods of regulariza- tion. For example the Spectral Cut-Off regularization approximates the Kn using eigenvectors with eigenvalues greater than some threshold α (using eigenvectors with the largest eigenvalues first as these correspond to most of the variation within the instrument). (Kα n )−1 r := λ2 jn≥α 1 λjn <r, φjn> φjn (2.7) Kα n is the truncated approximation to the Kn. α is the tuning parameter and as discussed in Carrasco (2012) can be viewed as the counterpart to the tuning param- eter in non-parametric estimation. See Carrasco (2012) for details on regularizing potentially infinite dimensional matrices and for other forms of regularization. The PC regularization is directly linked to the SC method approximates Kn using those eigenvectors with corresponding eigenvalues less than or equal to 1/α. Then for α → 0 fast enough relative to n → ∞ Carrasco (2012) shows the linear IV- GMM estimator based on this regularization will reach the SPLB and be consistent and asymptotically normal under certain regularity conditions. Here the tuning parameter α is playing the role of m the number of instruments in DN. Carrasco (2012) derives the MSE as a function of α for α2 n → 0 n(ˆβ − β0)(ˆβ − β0) = σ2 H−1 + S(α) + r(α) (2.8) 44
  • 67. Many Weak Instruments & Normalized Principal Components Where r(α) is asymptotically negligible and, S(α) = H−1  ση ση q(α, λ2 j ) n + σ2 f (I − Pα)f n   H−1 (2.9) Where for Principal Components q(α, λj) = I(j ≤ 1 α) and Pα is the projection on the space spanned by the first n eigenvectors of the truncated sample covariance matrix. Note the similarity with DN where now the tuning parameter is α instead of m. We consider a special case of Carrasco (2012) where m ≤ n. In this case the MSE approximation in Carrasco (2012) for PC collapse to those of DN (2001) where the MSE approximations in DN are evaluated at the PCs as opposed to the original instruments Z. This is the setting considered in the simulation experiment in Section 2.4. Though the Carrasco (2012) MSE approximations are more general in allowing m > n, PCs are ordered by their eigenvalues which is shown in Section 2.3 to potentially lead to an IV-GMM estimator with poor small sample properties. Instrument Reduction Techniques This section discusses the PC method of instrument reduction and highlights the potentially critical flaw in this technique for instrument selection. The NPC method is introduced and is demonstrated to overcome the flaw of PC methods. Conditions under which NPC works well asymptotically are also sketched. 2.3 Principal Components Ranking of In- struments When faced with an mx1 set of instruments Zi then it is plausible there may exist some linear combination of these variables that explain a large portion of the variation within Zi. Principal Components is a method that identities which linear combinations of the variables Zi explain most of the total variation in Zi. Define Σ := V ar(Zi) the unknown population covariance matrix of the instru- ments. Define Pj is the j’th eigenvector corresponding to the eigenvalue λj (i.e PjΣPj = λj) for j = {1, .., m}, P := (P1, .., Pm) and Λ an m × m diagonal matrix 45
  • 68. Chapter 2 where [Λ]jj = λj. Since Σ is symmetric (by construction) and positive definite Σ may be eigen-decomposed as Σ = PΛP where P P = Im×m The j’th PC is defined as Zpc ij = PjZi. The PC with the largest variance is the one with the largest eigenvalue. Since V ar(Zpc ij ) = PjV ar(Zi)Pj = PjΣPj = λj. Hence the variance of the PC Zpc ij (which is a linear combination of Zi using as weights the eigenvector Pj) is λj. See Jolliffe (2002) for a great discussion on the method of Principal Components. The method of PC as applied to instrumental variables is to transform the instru- ment set using the eigenvectors P and rank this transformed set of instruments by their corresponding eigenvalues. Namely order PCs by the size of their variance. This is then used as a basis for dimension reduction of the instrument set. Without loss of generality, the PCs are ordered such that λ1 ≥ λ2... ≥ λm To implement the PC method in practise requires a consistent estimate of Σ with which to estimate P (the matrix of PC weighings) and Λ (the diagonal matrix with the eigenvalues of Σ along the diagonal). Natural estimates can be formed taking the sample variance of Zi as an estimate of Σ. Namely ˆΣ := 1 n n i=1(Zi− ¯Z)(Zi− ¯Z) where ¯Z := 1 n n i=1 Zi. Σ can be eigen-decomposed as ˆΣ = ˆP ˆΛ ˆP where ˆP ˆP = Im×m and ˆΛ is a diagonal matrix with sample eigenvalues ˆλj where ˆPj ˆΣ ˆPj = ˆλj (j = {1, .., m}) along the diagonal. When ||ˆΣ − Σ|| = Op(n−1/2) ( sufficient conditions being that Zi is i.i.d where E[||gi(β0)||2],(i = {1, .., n}) it can be shown that || ˆP − P || = Op(mn−1/2) and ||ˆΛ − Λ|| = Op(mn−1/2) (e.g Bosq (2000)). So long as m2/n → 0 (along with some regularity conditions) then ˆPj and ˆλj consistently estimate Pj and λj respectively for j = {1, .., m} A common sample PC method then estimates Znpc ij as ˆZpc ij = ˆPjZi which has sample variance equal to ˆλj similar to the population analog above. The samples PCs are then ranked based on the size of the sample eigenvalues (i.e their sample variance). 2.3.1 Problem With The PC Method of Instrument Reduction This section demonstrates the potential flaw with using PC methods as a basis of instrument reduction. In order to illustrate this problem the linear IV set up in 46
  • 69. Many Weak Instruments & Normalized Principal Components Section 2.3 with one endogenous variable an no exogenous variables (i.e p = 1) is used. The idea extends readily to more than one endogenous variable and with exogenous controls 5. xi = π Zi + ηi (2.10) Where π is an m × 1 vector of first stage coefficients. Take the population PCs Zpc i = P Zi (where Zpc ij = PjZi as defined above). Defin- ing πpc := P π where πpcj is the j’th element of πpc then xi = π Zi + ηi = π PP Zi + ηi = πpcZpc i + ηi (2.11) Since P P = I and hence P = P−1. So then πpcj is the population coefficient from a regression of xi on the of the j’th PC. We now derive the variation each PC explains to the total variation in xi explained by the all PCs (i.e decompose the total variation in xi explained by the whole instrument set for the PCs) . V ar(π Zi) = π Σπ = π PΛP π = πpcΛπpc = m j=1 π2 pcjλj (2.12) The PC method then (asymptotically) ranks the transformed instruments Zpc ij by λj. However the variation that the j’th PC Zpc ij contributes to the total variation of xi explained by all m instruments ( V ar(π Zi)) is π2 pcjλj. If πpcj = 0 then the j’th PC is irrelevant, irrespective of the size of λj. In fact basing dimension reduction using a ranking based on the eigenvalues could give a reverse ranking. Take for example the simple case where πpcj = λ−δ j ∀ j = {1, ..m} where δ > 1/2. A simple example where the PC ranking provides a correct ranking of the strength of the PCs is where πpcj = πpci for all i, j ∈ {1, ..m}. In fact it could be the case that all the variation in xi explained by all PCs lie within those components with the smallest eigenvalues. Take the extreme example where πpcj = 0 ∀ j ∈ {1, ..(m − 1)} and πpcm = 0. However in general the PC with the smallest eigenvalues would be dropped using the PC method of dimension reduction. 5 m(Zi) from (2.3) is omitted since under assumption it is asymptotically negligible for m → ∞. 47
  • 70. Chapter 2 2.4 Normalized Principal Components A more intuitive way to transform the instrument set with which to form an ordering of the new instruments is to normalize all the PCs to have equal variance. Define Znpc i = Λ−1/2Zpc i . Then V ar(Znpc i ) = Λ−1/2V ar(Zpc i )Λ−1/2 = Λ−1/2ΛΛ−1/2 = Im×m. Then Znpc i are the Normalized Principal Components (NPCs). Define πnpc := Λ1/2P π xi = π Zi + ηi = πnpcZnpc i + ηi (2.13) Since πnpcZnpc i = π PΛ1/2Λ−1/2P Zi = π Zi. Letting πnpcj denote the j’th element of πnpc then the variation in xi explained the m NPCs can be expressed as, V ar(π Zi) = V ar(πnpcZnpc i ) = πnpcV ar(Znpc i )πnpcj = πnpcπnpc = m j=1 π2 npcj (2.14) Since π2 npcj is the contribution of the j’th NPC to the total variation in xi explained by all NPCs. The NPCs may then be ranked in terms of their relevance by the absolute size of their parameters in the first stage regression. A natural measure of the strength of NPC j Sj := π2 npcj/ m l=1 π2 npcl (2.15) The proportion of the total variation explained by NPC j. A natural way of then ordering NPCs is by Sj. Arrange NPCs with Sj from largest to smallest, hence S1 ≥ S2.. ≥ Sm. The amount of the total variation in xi explained by the whole set of NPCs (and hence the total variation in the original instrument set) by the first k NPCs C≤k C≤k := k l=1 π2 npcl/ m l=1 π2 npcl = k j=1 Sj (2.16) C≤k is the proportion of the total variation of the endogenous variable explained the whole instrument set captured by the first k NPCs. This is an extremely useful tool for visualizing how the informative content in a set of instruments is spread 48