MUMS: Bayesian, Fiducial, and Frequentist Conference - Inference on Treatment Effects after Model Selection, Jingshen Wang, April 30, 2019

Inference on Treatment Eﬀects after Model Selection
Jingshen Wang
Department of Statistics, University of Michigan
April 30th, 2019
1 / 22

Estimation of treatment eﬀects/structural parameters
Yerushalmy (1971) and Wilcox and Russell (1983)
2 / 22

Engle et al. (1986)
2 / 22

Pashkevich et al. (2012) and Chaudhuri et al. (2017)
2 / 22

2 / 22

Inference on α when p is large
2 / 22

Literature review
Post selection inference
Uniform Inference
Berk et al. (2013), Bachoc et al. (2016), Kuchibhotla et al. (2018)
Data Splitting
Rinaldo et al. (2016), Fithian et al. (2014)
Selective (conditional) Inference
Lee et al. (2016), Zhao et al. (2017), Tian and Taylor (2018)
3 / 22

Literature review
Post selection inference
Uniform Inference
Berk et al. (2013), Bachoc et al. (2016), Kuchibhotla et al. (2018)
Data Splitting
Rinaldo et al. (2016), Fithian et al. (2014)
Selective (conditional) Inference
Lee et al. (2016), Zhao et al. (2017), Tian and Taylor (2018)
Commonality of these diﬀerent approaches: a data dependent target βM
.
In this talk: structural parameter α as target.
3 / 22

Inference on treatment eﬀects after model selection
4 / 22

Key points of the talk
Refitting approach
αrefit is biased: over-fitting and under-fitting.
Provide statistical insight in the bias.
Develop repeated data splitting procedure to remove the bias.
Cross-fitting is not as efficient as the repeated data splitting.
5 / 22

High-dimensional approximately linear model
Model setup
Y = αD + Xβ + Rn + ε, E(ε|D, X) = 0.
α − parameter of interest
6 / 22

Model setup
Y = αD + Xβ + Rn + ε, E(ε|D, X) = 0.
D − treatment or variable of interest
6 / 22

Model setup
Y = αD + Xβ + Rn + ε, E(ε|D, X) = 0.
X − high dimensional covariates (e.g. basis functions for nonparametric
regression functions)
6 / 22

Model setup
Y = αD + Xβ + Rn + ε, E(ε|D, X) = 0.
ε − noise
6 / 22

Model setup
Y = αD + Xβ + Rn + ε, E(ε|D, X) = 0.
ε − noise
β − sparse vector of coeﬃcients, i.e.
M0 = {j : βj = 0, j = 1, · · · , p}, |M0| = s0 p.
6 / 22

Model setup
Y = αD + Xβ + Rn + ε, E(ε|D, X) = 0.
ε − noise
M0 = {j : βj = 0, j = 1, · · · , p}, |M0| = s0 p.
Rn − approximation error
6 / 22

Model setup
Y = αD + Xβ + Rn + ε, E(ε|D, X) = 0.
ε − noise
M0 = {j : βj = 0, j = 1, · · · , p}, |M0| = s0 p.
Rn − approximation error
Under Neyman-Robin causal model and the unconfoundedness assumption, α is
the causal eﬀect.
6 / 22

Common perception and challenges for inference after refitting
A common perception
Inference after refitting is valid, because many model selection methods satisfy
the “oracle property” (Fan and Li, 2001)
lim
n→∞
P(M = M0) = 1.
Challenges
“Oracle property” requires strong stringent assumptions.
Perfect model selection does not happen with high probability in finite
samples.
7 / 22

Reﬁtting bias if M = M0
8 / 22

Reﬁtting bias based on Lasso: illustrative example
Simulation study
α = 3, β = (1, 1, 0.5, 0.5, 0, . . . , 0) ∈ Rp
(n, p) = (100, 500)
Σij = 0.9|i−j|
# Monte Carlo samples: 1000
Model selection via adaptive Lasso:
M = j ∈ {1, . . . , p} : βj = 0 ,
where
(α, β ) = arg min
α,β
1
n
n
i=1
(Yi − αDi − Xiβ)2
+ λ
p
j=1
|βj|
wj
.
9 / 22

Reﬁtting bias based on Lasso: illustrative example
Simulation study
α = 3, β = (1, 1, 0.5, 0.5, 0, . . . , 0) ∈ Rp
(n, p) = (100, 500)
Σij = 0.9|i−j|
# Monte Carlo samples: 1000
Model selection via adaptive Lasso:
M = j ∈ {1, . . . , p} : βj = 0 ,
where
(α, β ) = arg min
α,β
1
n
n
i=1
(Yi − αDi − Xiβ)2
+ λ
p
j=1
|βj|
wj
.
Selected model size |M| is parametrized by λ.
9 / 22

Reﬁtting bias: illustrative example
Note: a smaller λ yields a larger model, i.e. (− log λ) ↑ ⇒ |M| ↑
10 / 22

Reﬁtting bias: illustrative example
Note: a smaller λ yields a larger model, i.e. (− log λ) ↑ ⇒ |M| ↑.
10 / 22

Percentage of under-fitted models vs. model size
Refitting bias of large models cannot be due to under-fitting.
11 / 22

Percentage of over-ﬁtted models vs. model size
11 / 22

Percentage of over-fitted models vs. model size
Refitting bias of large models is due to over-fitting.
11 / 22

Percentage of perfect model selection vs. model size
11 / 22

Percentage of perfect model selection vs. model size
Perfect model selection never happens with high probability.
11 / 22

Summary: Refitting bias of M
αrefit − α = e1(Z
M
ZM
)−1
Z
M
ε
over-fitting
+ (D (I − PM
)D)−1
D (I − PM
)Xβ
under-fitting
,
where ZM
= (D, XM
), and PM
= XM
(X
M
XM
)−1
XM
12 / 22

M
ZM
)−1
Z
M
ε
over-fitting
+ (D (I − PM
)D)−1
D (I − PM
)Xβ
under-fitting
,
where ZM
= (D, XM
), and PM
= XM
(X
M
XM
)−1
XM
Over-fitting and under-fitting bias
If M ⊂ M0, αrefit has under-fitting bias (omitted variable bias).
12 / 22

M
ZM
)−1
Z
M
ε
over-fitting
+ (D (I − PM
)D)−1
D (I − PM
)Xβ
under-fitting
,
where ZM
= (D, XM
), and PM
= XM
(X
M
XM
)−1
XM
If M0 ⊂ M, αrefit has over-fitting bias due to spurious correlation (fan)
E(αrefit − α) = E e1 Z
M
ZM
−1
Z
M
E(ε|ZM
) .
12 / 22

M
ZM
)−1
Z
M
ε
over-fitting
+ (D (I − PM
)D)−1
D (I − PM
)Xβ
under-fitting
,
where ZM
= (D, XM
), and PM
= XM
(X
M
XM
)−1
XM
If M0 ⊂ M, αrefit has over-fitting bias due to spurious correlation (fan)
E(αrefit − α) = E e1 Z
M
ZM
−1
Z
M
E(ε|ZM
) .
Over- and under-fitting bias may occur simultaneously.
Hong et al. (2018) and Chernozhukov et al. (2018) discussed a similar bias issue.
12 / 22

Removing the over-fitting bias by data splitting
Suppose that M0 ⊂ M. Then the refitted estimator simplifies to
M
ZM
)−1
Z
M
ε.
13 / 22

M
ZM
)−1
Z
M
ε.
Remove the over-ﬁtting bias by data splitting (Mosteller and Tukey, 1977):
13 / 22

M
ZM
)−1
Z
M
ε.
On T2, the over-ﬁtting bias vanishes since
E(εT2 |ZM
) = 0.
13 / 22

M
ZM
)−1
Z
M
ε.
On T2, the over-ﬁtting bias vanishes since
E(εT2 |ZM
) = 0.
Data-splitting removes the over-ﬁtting bias, but it increases the estimation
variability.
13 / 22

R-Split: Repeated Data Splitting
14 / 22

On each split, αk depends on the data and random subsample indices.
14 / 22

In theory, B → ∞ and
α = E (αk|Data) .
In practice, B is a large number, e.g. B = 1000.
15 / 22

α = E (αk|Data) .
R-Split is similar to Bagging (Breiman, 1996).
15 / 22

α = E (αk|Data) .
R-Split is similar to Bagging (Breiman, 1996).
Sub-samples for both estimation and model selection are random and can
overlap.
15 / 22

Why not cross-ﬁtting?
16 / 22

Cross-ﬁtting vs. R-Split
αcv − α =
1
2
Σ−1
M1
IM1
+ Σ−1
M2
IM2
1
n
n
i=1
εiZi + op(1/
√
n)
17 / 22

αcv − α =
1
2
Σ−1
M1
IM1
+ Σ−1
M2
IM2
1
n
n
i=1
εiZi + op(1/
√
n)
Variance decomposition of αCV
Var(αcv − α) = E Var
1
2
Σ−1
M1
IM1
+ Σ−1
M2
IM2
1
n
n
i=1
εiZi Data
+ Var E
1
2
Σ−1
M1
IM1
+ Σ−1
M2
IM2
1
n
n
i=1
εiZi Data
variance of R-Split
≥Var(α − α)
17 / 22

αcv − α =
1
2
Σ−1
M1
IM1
+ Σ−1
M2
IM2
1
n
n
i=1
εiZi + op(1/
√
n)
1
2
Σ−1
M1
IM1
+ Σ−1
M2
IM2
1
n
n
i=1
εiZi Data
+ Var(α − α)
≥Var(α − α)
17 / 22

αcv − α =
1
2
Σ−1
M1
IM1
+ Σ−1
M2
IM2
1
n
n
i=1
εiZi + op(1/
√
n)
1
2
Σ−1
M1
IM1
+ Σ−1
M2
IM2
1
n
n
i=1
εiZi Data
+ Var(α − α)
≥Var(α − α)
If M1 = M2 = M0, then Var(αcv − α) = Var(α − α).
R-Split reduces the variance by aggregating over all possible random
models.
17 / 22

R-Split: Asymptotic Normality
Theorem (R-Split)
Under certain assumptions, the R-Split estimator has the following linear
representation
α − α = ηn
1
n
n
i=1
εiZi + op(1/
√
n),
and thus
σ−1
n
√
n(α − α) ; N(0, 1),
with σn = σε ηnΣnηn
1
2
, Σ = Z Z/n, and Z = (D, X).
18 / 22

R-Split: Regularity assumptions
Assumption 1. Characterization of ηn
There exists a random vector ηn ∈ Rp+1
which is independent of ε and satisﬁes
E P e1Σ−1
M
Data − ηn
1
= op 1/ log p ,
where P : R|M|
→ Rp+1
is an embedding that sparsiﬁes a vector.
19 / 22

E P e1Σ−1
M
Data − ηn
1
= op 1/ log p ,
where P : R|M|
→ Rp+1
Suppose M = M0 for all splits,
ηn,j =
(e1Σ−1
M0
)j if j ∈ M0,
0 otherwise,
and therefore
α − α = e1Σ−1
M0
1
n
n
i=1
εiZi,M0 + op(1/
√
n).
19 / 22

E P e1Σ−1
M
Data − ηn
1
= op 1/ log p ,
where P : R|M|
→ Rp+1
Suppose M = M0 for all splits,
ηn,j =
(e1Σ−1
M0
)j if j ∈ M0,
0 otherwise,
and therefore
α − α = e1Σ−1
M0
1
n
n
i=1
εiZi,M0 + op(1/
√
n).
For ﬁxed model M0, α reduces to OLS based on the full sample.
Our theory generalizes OLS based on ﬁxed to random models.
19 / 22

E P e1Σ−1
M
Data − ηn
1
= op 1/ log p ,
where P : R|M|
→ Rp+1
Assumption 2. (Negligible under-ﬁtting bias)
The under-ﬁtting bias is negligible after averaging over all splits.
19 / 22

E P e1Σ−1
M
Data − ηn
1
= op 1/ log p ,
where P : R|M|
→ Rp+1
Assumption 3. (“Robust” model selection procedure)
The distribution of M remains stable if only one out of n observations changes.
19 / 22

E P e1Σ−1
M
Data − ηn
1
= op 1/ log p ,
where P : R|M|
→ Rp+1
Assumption 3. (“Robust” model selection procedure)
The distribution of M remains stable if only one out of n observations changes.
Assumption 4. (Sparsity level)
The selected model sizes are of the same order as s0 and s0 = o(n).
19 / 22

Conclusion
Refitting approach
The bias of αrefit is composed of two parts: under-fitting and over-fitting.
R-Split (Repeated data Splitting) removes the over-fitting bias without
much sacrifice of efficiency.
R-Split is more efficient than cross-fitting.
20 / 22

Conclusion
Refitting approach
The bias of αrefit is composed of two parts: under-fitting and over-fitting.
R-Split (Repeated data Splitting) removes the over-fitting bias without
much sacrifice of efficiency.
R-Split is more efficient than cross-fitting.
Jingshen Wang, Xuming He, and Gongjun Xu. Debiased Inference on Treatment Effect
in a High Dimensional Model. Journal of the American Statistical Association, 2019.
20 / 22

Reference I
François Bachoc, David Preinerstorfer, and Lukas Steinberger. Uniformly valid confidence intervals post-model-selection. arXiv preprint
arXiv:1611.01043, 2016.
Richard Berk, Lawrence Brown, Andreas Buja, Kai Zhang, and Linda Zhao. Valid post-selection inference. The Annals of Statistics, 41(2):802–837,
2013.
Leo Breiman. Bagging predictors. Machine learning, 24(2):123–140, 1996.
Sougata Chaudhuri, Abraham Bagherjeiran, and James Liu. Ranking and calibrating click-attributed purchases in performance display advertising. In
Proceedings of the ADKDD’17, page 7. ACM, 2017.
Victor Chernozhukov, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, Whitney Newey, and James Robins. Double/debiased machine
learning for treatment and structural parameters. The Econometrics Journal, 21(1):C1–C68, 2018.
Bradley Efron. Estimation and accuracy after model selection. Journal of the American Statistical Association, 109(507):991–1007, 2014.
Robert F Engle, Clive WJ Granger, John Rice, and Andrew Weiss. Semiparametric estimates of the relation between weather and electricity sales.
Journal of the American statistical Association, 81(394):310–320, 1986.
Jianqing Fan and Runze Li. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American statistical
Association, 96(456):1348–1360, 2001.
William Fithian, Dennis Sun, and Jonathan Taylor. Optimal inference after model selection. arXiv preprint arXiv:1410.2597, 2014.
Liang Hong, Todd A Kuffner, and Ryan Martin. On overfitting and post-selection uncertainty assessments. Biometrika, 105(1):221–224, 2018.
Arun Kumar Kuchibhotla, Lawrence D Brown, Andreas Buja, Edward I George, and Linda Zhao. A model free perspective for linear regression:
Uniform-in-model bounds for post selection inference. arXiv preprint arXiv:1802.05801, 2018.
Jason D Lee, Dennis L Sun, Yuekai Sun, and Jonathan E Taylor. Exact post-selection inference, with application to the lasso. The Annals of Statistics,
44(3):907–927, 2016.
Frederick Mosteller and John Wilder Tukey. Data analysis and regression: a second course in statistics. Addison-Wesley Series in Behavioral Science:
Quantitative Methods, 1977.
Max Pashkevich, Sundar Dorai-Raj, Melanie Kellar, and Dan Zigmond. Empowering online advertisements by empowering viewers with the right to
choose: the relative effectiveness of skippable video advertisements on youtube. Journal of Advertising Research, 52(4):451–457, 2012.
Alessandro Rinaldo, Larry Wasserman, Max G’Sell, Jing Lei, and Ryan Tibshirani. Bootstrapping and sample splitting for high-dimensional,
assumption-free inference. arXiv preprint arXiv:1611.05401, 2016.
Xiaoying Tian and Jonathan Taylor. Selective inference with a randomized response. The Annals of Statistics, 46(2):679–710, 2018.
Allen J Wilcox and Ian T Russell. Birthweight and perinatal mortality: I. On the frequency distribution of birthweight. International Journal of
Epidemiology, 12(3):314–318, 1983.
J Yerushalmy. The relationship of parents’ cigarette smoking to outcome of pregnancy–implications as to the problem of inferring causation from
observed associations. American Journal of Epidemiology, 93(6):443–443, 1971.
Qingyuan Zhao, Dylan S Small, and Ashkan Ertefaie. Selective inference for effect modification via the lasso. arXiv preprint arXiv:1705.08020, 2017.
21 / 22

R-Split: estimation of the variance
Estimator of the variance of α
By the non-parametric delta method, we have
σ2
n =n
n
j=1
n − 1
n − n2
B−1
B
b=1
(vbj − B−1
B
k=1
vkj)αb
2
approx. of the squared influence function
−
n2n
B2(n − n2)
B
b=1
(αb − α)2
finite “B”-bias correction
.
B : the number of the repeated data splitting
n2 : the size of the sample used for refitting
vbj =
1 if jth obs. is used for refitting in bth sub-sample
0 otherwise.
Note: this is a generalization of the nonparametric delta method for
bootstrapping in Efron (2014).
22 / 22

MUMS: Bayesian, Fiducial, and Frequentist Conference - Inference on Treatment Effects after Model Selection, Jingshen Wang, April 30, 2019

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to MUMS: Bayesian, Fiducial, and Frequentist Conference - Inference on Treatment Effects after Model Selection, Jingshen Wang, April 30, 2019

Similar to MUMS: Bayesian, Fiducial, and Frequentist Conference - Inference on Treatment Effects after Model Selection, Jingshen Wang, April 30, 2019 (20)

More from The Statistical and Applied Mathematical Sciences Institute

More from The Statistical and Applied Mathematical Sciences Institute (20)

Recently uploaded

Recently uploaded (20)

MUMS: Bayesian, Fiducial, and Frequentist Conference - Inference on Treatment Effects after Model Selection, Jingshen Wang, April 30, 2019