- Bayesian adjustment for confounding (BAC) in Bayesian propensity score estimation accounts for uncertainty in propensity score modeling and model selection.
- A prognostic score model is used to inform a prior on propensity score model selection, favoring inclusion of true confounders and exclusion of instruments.
- Simulation results found the informative prior was not able to adequately shape model selection; a penalty term was proposed to make the prior more influential.
- With the penalty term, the informative prior influenced inclusion of instruments in propensity score models without distorting inclusion of other variables.
Linear Discriminant Analysis (LDA) Under f-Divergence MeasuresAnmol Dwivedi
For more details, please have a look at:
1. https://www.mdpi.com/1099-4300/24/2/188
2. https://ieeexplore.ieee.org/document/9518004
Abstract:
In statistical inference, the information-theoretic performance limits can often be expressed in terms of a notion of divergence between the underlying statistical models (e.g., in binary hypothesis testing, the total error probability is equal to the total variation between the models). As the data dimension grows, computing the statistics involved in decision-making and the attendant performance limits (divergence measures) face complexity and stability challenges. Dimensionality reduction addresses these challenges at the expense of compromising the performance (divergence reduces due to the data processing inequality for divergence). This paper considers linear dimensionality reduction such that the divergence between the models is \emph{maximally} preserved. Specifically, the paper focuses on the Gaussian models and characterizes an optimal projection of the data onto a lower-dimensional subspace with respect to four $f$-divergence measures (Kullback-Leibler, $\chi^2$, Hellinger, and total variation). There are two key observations. First, projections are not necessarily along the dominant modes of the covariance matrix of the data, and even in some situations, they can be along the least dominant modes. Secondly, under specific regimes, the optimal design of subspace projection is identical under all the $f$-divergence measures considered, rendering a degree of universality to the design independent of the inference problem of interest.
Non-sampling functional approximation of linear and non-linear Bayesian UpdateAlexander Litvinenko
We offer a non-sampling functional approximation of non-linear surrogate to classical Bayesian Update formula. We start with prior Polynomial Chaos Expansion (PCE), express log-likelihood in a PCE basis and obtain a new posterior PCE.
Main IDEA is to update not probability density, but basis coefficients.
Linear Discriminant Analysis (LDA) Under f-Divergence MeasuresAnmol Dwivedi
For more details, please have a look at:
1. https://www.mdpi.com/1099-4300/24/2/188
2. https://ieeexplore.ieee.org/document/9518004
Abstract:
In statistical inference, the information-theoretic performance limits can often be expressed in terms of a notion of divergence between the underlying statistical models (e.g., in binary hypothesis testing, the total error probability is equal to the total variation between the models). As the data dimension grows, computing the statistics involved in decision-making and the attendant performance limits (divergence measures) face complexity and stability challenges. Dimensionality reduction addresses these challenges at the expense of compromising the performance (divergence reduces due to the data processing inequality for divergence). This paper considers linear dimensionality reduction such that the divergence between the models is \emph{maximally} preserved. Specifically, the paper focuses on the Gaussian models and characterizes an optimal projection of the data onto a lower-dimensional subspace with respect to four $f$-divergence measures (Kullback-Leibler, $\chi^2$, Hellinger, and total variation). There are two key observations. First, projections are not necessarily along the dominant modes of the covariance matrix of the data, and even in some situations, they can be along the least dominant modes. Secondly, under specific regimes, the optimal design of subspace projection is identical under all the $f$-divergence measures considered, rendering a degree of universality to the design independent of the inference problem of interest.
Non-sampling functional approximation of linear and non-linear Bayesian UpdateAlexander Litvinenko
We offer a non-sampling functional approximation of non-linear surrogate to classical Bayesian Update formula. We start with prior Polynomial Chaos Expansion (PCE), express log-likelihood in a PCE basis and obtain a new posterior PCE.
Main IDEA is to update not probability density, but basis coefficients.
Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Cat...Yandex
We consider a new class of huge-scale problems, the problems with sparse subgradients. The most important functions of this type are piecewise linear. For optimization problems with uniform sparsity of corresponding linear operators, we suggest a very efficient implementation of subgradient iterations, the total cost of which depends logarithmically in the dimension. This technique is based on a recursive update of the results of matrix/vector products and the values of symmetric functions. It works well, for example, for matrices with few nonzero diagonals and for max-type functions.
We show that the updating technique can be efficiently coupled with the simplest subgradient methods. Similar results can be obtained for a new non-smooth random variant of a coordinate descent scheme. We also present promising results of preliminary computational experiments.
How to find a cheap surrogate to approximate Bayesian Update Formula and to a...Alexander Litvinenko
We suggest the new vision for classical Bayesian Update formula. We expand all ingredients in Polynomial Chaos Expansion and write out a new formula for Bayesian* update of PCE coefficients. This formula is derived from Minimum Mean Square Estimation. One starts with prior PCE, take measurements into account, and obtain posterior PCE coefficients, without any MCMC sampling.
Probabilistic Control of Switched Linear Systems with Chance ConstraintsLeo Asselborn
An approach to algorithmically synthesize control
strategies for set-to-set transitions of uncertain discrete-time
switched linear systems based on a combination of tree search
and reachable set computations in a stochastic setting is
proposed in this presentation. The initial state and disturbances
are assumed to be Gaussian distributed, and a time-variant
hybrid control law stabilizes the system towards a goal set.
The algorithmic solution computes sequences of discrete states
via tree search and the continuous controls are obtained
from solving embedded semi-definite programs (SDP). These
program taking polytopic input constraints as well as timevarying
probabilistic state constraints into account. An example
for demonstrating the principles of the solution procedure with
focus on handling the chance constraints is included.
• Treatment regimes for a single decision point (potential outcomes, value)
• Estimation of the value of a fixed regime (identifiability assumptions, outcome regression estimator, IPW/AIPW estimators)
• Characterization of an optimal regime (in terms of potential outcomes, observed data)
• Estimation of an optimal regime (regression, A-learning, direct search IPW/AIPW, nonregularity)
Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Cat...Yandex
We consider a new class of huge-scale problems, the problems with sparse subgradients. The most important functions of this type are piecewise linear. For optimization problems with uniform sparsity of corresponding linear operators, we suggest a very efficient implementation of subgradient iterations, the total cost of which depends logarithmically in the dimension. This technique is based on a recursive update of the results of matrix/vector products and the values of symmetric functions. It works well, for example, for matrices with few nonzero diagonals and for max-type functions.
We show that the updating technique can be efficiently coupled with the simplest subgradient methods. Similar results can be obtained for a new non-smooth random variant of a coordinate descent scheme. We also present promising results of preliminary computational experiments.
How to find a cheap surrogate to approximate Bayesian Update Formula and to a...Alexander Litvinenko
We suggest the new vision for classical Bayesian Update formula. We expand all ingredients in Polynomial Chaos Expansion and write out a new formula for Bayesian* update of PCE coefficients. This formula is derived from Minimum Mean Square Estimation. One starts with prior PCE, take measurements into account, and obtain posterior PCE coefficients, without any MCMC sampling.
Probabilistic Control of Switched Linear Systems with Chance ConstraintsLeo Asselborn
An approach to algorithmically synthesize control
strategies for set-to-set transitions of uncertain discrete-time
switched linear systems based on a combination of tree search
and reachable set computations in a stochastic setting is
proposed in this presentation. The initial state and disturbances
are assumed to be Gaussian distributed, and a time-variant
hybrid control law stabilizes the system towards a goal set.
The algorithmic solution computes sequences of discrete states
via tree search and the continuous controls are obtained
from solving embedded semi-definite programs (SDP). These
program taking polytopic input constraints as well as timevarying
probabilistic state constraints into account. An example
for demonstrating the principles of the solution procedure with
focus on handling the chance constraints is included.
• Treatment regimes for a single decision point (potential outcomes, value)
• Estimation of the value of a fixed regime (identifiability assumptions, outcome regression estimator, IPW/AIPW estimators)
• Characterization of an optimal regime (in terms of potential outcomes, observed data)
• Estimation of an optimal regime (regression, A-learning, direct search IPW/AIPW, nonregularity)
Solving inverse problems via non-linear Bayesian Update of PCE coefficientsAlexander Litvinenko
We derive non-linear approximation of Bayesian update for PCE coefficients. We avoid the usage of Monte Carlo Markov Chain formula to compute posterior.
Minimum mean square error estimation and approximation of the Bayesian updateAlexander Litvinenko
We develop a Bayesian update surrogate. Our formula allows us to update polynomial chaos coefficients. In contrast to classical Bayesian approach, we suggest to update PCE coefficients. We show that classical Kalman filter is a particular case of our update.
Tutorial on Belief Propagation in Bayesian NetworksAnmol Dwivedi
The goal of this mini-project is to implement belief propagation algorithms for posterior probability inference and most probable explanation (MPE) inference for the Bayesian Network with binary values in which the Conditional Probability Table for each random-variable/node is given.
Connection between inverse problems and uncertainty quantification problems
GonzalezGinestetResearchDay2016
1. Bayesian Adjustment for Confounding
(BAC) in Bayesian Propensity Score
Estimation
Pablo Gonzalez Ginestet
McGill University, CNODES, Lady Davis Institute
pablo.gonzalezginestet@mail.mcgill.ca
12th Annual EBOH Research Day
Montreal, April 2016
2. Outline
1 Traditional PS Estimation
2 Bayesian PS Estimation
Uncertainty regards the PS
Uncertainty regards the PS + Model uncertainty
3 BAC in Bayesian PS & Results
First Stage
Second Stage
4 Conclusion
3. Traditional PS Estimation
It is a sequential process:
1 “PS stage”: PS = P(Xi = 1|Ci ) is estimated:
logit(PS) =
p
k=1
γk Ck,i
2 “Outcome stage”: the causal effect is estimated adjusting for the
ˆPS(ˆγ)
logit(E[Yi |Xi , Ci ]) = β0 + βX Xi + ξh( ˆPS)
4. Traditional PS Estimation
It is a sequential process:
1 “PS stage”: PS = P(Xi = 1|Ci ) is estimated:
logit(PS) =
p
k=1
γk Ck,i
2 “Outcome stage”: the causal effect is estimated adjusting for the
ˆPS(ˆγ)
logit(E[Yi |Xi , Ci ]) = β0 + βX Xi + ξh( ˆPS)
Remark 1. Outcome stage treats ˆPS as fixed and known ⇒ It
ignores the uncertainty in regards to the PS.
5. Traditional PS Estimation
It is a sequential process:
1 “PS stage”: PS = P(Xi = 1|Ci ) is estimated:
logit(PS) =
p
k=1
γk Ck,i
2 “Outcome stage”: the causal effect is estimated adjusting for the
ˆPS(ˆγ)
logit(E[Yi |Xi , Ci ]) = β0 + βX Xi + ξh( ˆPS)
Remark 1. Outcome stage treats ˆPS as fixed and known ⇒ It
ignores the uncertainty in regards to the PS.
Remark 2. It ignores model uncertainty regarding the selection of
confounders for the PS.
The set of covariate is fixed⇒ MPS = { one model }
6. Uncertainty regards the PS
Bayesian PS Estimation
(McCandles, Gustafson and Austin 2009, and Zigler et al. 2013)
Bayesian PS estimates the PS stage and outcome stage
simultaneously:
“PS stage”:
logit(PS) =
p
k=1
γkCk,i
“Outcome stage”:
logit(E[Yi |Xi , Ci ]) = β0 + βX Xi + ξh(PS) +
p
k=1
δkCk,i
The set of covariate is fixed⇒ MPS = { one model }
7. Uncertainty regards the PS + Model
uncertainty
Bayesian PS Estimation
(Zigler and Dominici 2014)
Bayesian PS estimates the PS stage and outcome stage
simultaneously:
“PS stage”:
logit(PS) =
p
k=1
α
x|c
k γkCk,i
“Outcome stage”:
logit(E[Yi |Xi , Ci ]) = β0 + βX Xi + ξh(PS) +
p
k=1
α
x|c
k δkCk,i
The set of covariates is NOT fixed⇒ MPS = { all possible models}
8. Uncertainty regards the PS + Model
uncertainty
Bayesian PS Estimation
(Zigler and Dominici 2014)
Bayesian PS estimates the PS stage and outcome stage
simultaneously:
“PS stage”:
logit(PS) =
p
k=1
α
x|c
k γkCk,i
“Outcome stage”:
logit(E[Yi |Xi , Ci ]) = β0 + βX Xi + ξh(PS) +
p
k=1
α
x|c
k δkCk,i
The set of covariates is NOT fixed⇒ MPS = { all possible models}
Posterior distribution of the ACE:
p(ACE|data) ≈
αx|c ∈MPS
p(ACEαx|c
|αx|c
, data)p(αx|c
|data)
9. Remarks, Motivation & Goal
Remark 1 Uninformative Prior: Each model has equal prior
probability: p(αx|c) = 1
|MPS |
10. Remarks, Motivation & Goal
Remark 1 Uninformative Prior: Each model has equal prior
probability: p(αx|c) = 1
|MPS |
Remark 2 Most of the time IVs are included in the PS model
11. Remarks, Motivation & Goal
Remark 1 Uninformative Prior: Each model has equal prior
probability: p(αx|c) = 1
|MPS |
Remark 2 Most of the time IVs are included in the PS model
Goal: To limit the selection of IVs
Strategy: Informative Prior on the PS model indicator αx|c
12. Illustrative Example
We simulate 250 replicated data sets under N = 1000 and p = 7
covariates
{C1, C2, C3} true confounder; {C4} risk factor of outcome; {C5, C6}
IVs and {C7} noise variable
ACE = = P(Y = 1|X = 1) − P(Y = 1|X = 0) = 0.06
13. First Stage
BAC in Bayesian PS
Prognostic score model (based on a single treatment group (Hansen
2008))
logit(E[Yi,0|Ci ]) =
p
k=1
α
y|c
k ηkCk,i
14. First Stage
BAC in Bayesian PS
Prognostic score model (based on a single treatment group (Hansen
2008))
logit(E[Yi,0|Ci ]) =
p
k=1
α
y|c
k ηkCk,i
Prior distribution on αx|c|αy|c following Wang et all. (2012):
p(α
x|c
k = 1|α
y|c
k = 0)
p(α
x|c
k = 0|α
y|c
k = 0)
=
1
ω
p(α
x|c
k = 1|α
y|c
k = 1)
p(α
x|c
k = 0|α
y|c
k = 1)
= 1
15. The above constrains imply the following:
P(α
x|c
k = 0|α
y|c
k = 0) =
ω
1 + ω
P(α
x|c
k = 1|α
y|c
k = 0) =
1
1 + ω
P(α
x|c
k = 0|α
y|c
k = 1) = P(α
x|c
k = 1|α
y|c
k = 1) =
1
2
16. First Stage: Informative prior p(αx|c
|Y) over all models across
ω = 1(−), 5(◦), 20( ), 50( ) and 100(•).
Our objective is p(αx|c
|Y) which is
p(αx|c
|Y) =
αy|c ∈My|c
p(αx|c
|αy|c
)p(αy|c
|Y)
17. Second Stage
BAC in Bayesian PS
It is a Bayesian PS stage with an informative prior p(αx|c|Y)
PS model
logit(P[Xi |Ci ]) =
p
k=1
α
x|c
k γkCk,i
Outcome model
logit(E[Yi |Xi , Ci ]) = β0 + βX Xi + ξh(PS) +
p
k=1
α
x|c
k δkCk,i
MPS = {128 models}
18. The expected role of the informative prior
BAC in Bayesian PS
In the MCMC, we propose to move from model αx|c (current) to
α x|c (proposed):
⇒adding one covariate (α
x|c
j = 0 → α
x|c
j = 1) or
⇒deleting one covariate (α
x|c
j = 1 → α
x|c
j = 0)
19. The expected role of the informative prior
BAC in Bayesian PS
In the MCMC, we propose to move from model αx|c (current) to
α x|c (proposed):
⇒adding one covariate (α
x|c
j = 0 → α
x|c
j = 1) or
⇒deleting one covariate (α
x|c
j = 1 → α
x|c
j = 0)
Accept the proposed move with probability (for the adding case)
min
L(data|θα x|c , α x|c)p(θα x|c |α x|c)
L(data|θαx|c , αx|c)p(θαx|c |αx|c)ϕ(u)
p(α x|c|Y)
p(αx|c|Y)
, 1
21. We propose to penalize the log-likelihood term of the model that
contains one covariate more.
If the proposed model contains one covariate more
(α
x|c
j = 0 → α
x|c
j = 1):
Ψα x|c = −2 × log(N) ×
p(αx|c|Y)
p(α x|c|Y)
22. We propose to penalize the log-likelihood term of the model that
contains one covariate more.
If the proposed model contains one covariate more
(α
x|c
j = 0 → α
x|c
j = 1):
Ψα x|c = −2 × log(N) ×
p(αx|c|Y)
p(α x|c|Y)
if the proposed model has one covariate less than the current model
(α
x|c
j = 1 → α
x|c
j = 0):
Ψαx|c = −2 × log(N) ×
p(α x|c|Y)
p(αx|c|Y)
23. p(α
x|c
k |Y, X) when acceptance probability includes penalty term
24. Bias and MSE of estimates of ACE across ω’s with ( ) and
without (◦) penalty vs Kitchen sink and Best subset
25. Conclusions
Novel approach: i) joining two methodology: BAC and Bayesian PS,
ii) informative prior and iii) applying RJMCMC.
26. Conclusions
Novel approach: i) joining two methodology: BAC and Bayesian PS,
ii) informative prior and iii) applying RJMCMC.
The simulation study found that:
the informative prior was not able to shape the profiles of models
selected
27. Conclusions
Novel approach: i) joining two methodology: BAC and Bayesian PS,
ii) informative prior and iii) applying RJMCMC.
The simulation study found that:
the informative prior was not able to shape the profiles of models
selected
we have proposed to solve the above adding a penalty term so the
informative prior kicks in.
28. Conclusions
Novel approach: i) joining two methodology: BAC and Bayesian PS,
ii) informative prior and iii) applying RJMCMC.
The simulation study found that:
the informative prior was not able to shape the profiles of models
selected
we have proposed to solve the above adding a penalty term so the
informative prior kicks in.
with the penalty term, the informative prior was able to influence the
PIP of the IV without distorting the PIP of all other variables.
Thanks to: Robert Platt (McGill U.); Francesca Dominici (Harvard
U.); Matt Cefalu (RAND);Genevi`eve Lefebvre (UdeM) Jay Kaufman
(McGill U.); Sahir Bhatnagar (McGill U.); Maxime Turgeon (McGill
U.);CNODES and Jewish General Hospital at Montreal.
29. Appendix. Simulation Exercise
p = 7 covariate ⇒ consist of 128 (= 27) models. Ignoring the model
with no covariate |My|c| = |MPS | = 27 − 1 = 127.
Two scenarios: i) N = 300, and ii) N = 1000.
For all i = 1, 2, ...., N:
we simulate p covariates Ci ∼ MVN(0, I)
the exposure variable Xi is simulated from a Bernoulli distribution with
probability given by:
P(Xi |Ci ) =
exp(
p
k=1 γk Ck,i )
1 + exp(
p
k=1 γk Cm,k ))
(1)
where we set γ = (γ1, γ2, ..., γ7) = (0.6, −0.6, 0.1, 0, 0.6, 0.1, 0).
the outcome variable Yi is generated similarly from a Bernoulli
distribution with probability given by:
P(Yi |Xi , Ci ) =
exp(β0 + βXi +
p
k=1 φk Ck,i )
1 + exp(β0 + βXi +
p
k=1 φk Ck,i ))
(2)
where we set φ = (φ1, φ2, ..., φ7) = (0.6, 0.1, −0.6, 0.6, 0, 0, 0) and
β0 = 0 and β = 0.1.
30. Thus, αx|c = (1, 1, 1, 0, 0, 0, 0) is the model that contains the
confounders necessary to satisfy the assumption of no unmeasured
confounders. This model is called the minimal model and we denote
it as α∗x|c.
Lastly, this setting implies a true ACE equal to = 0.06 (calculated
based on a much larger sample size using the true value of the
parameters in order to compute P(Y = 1|X = 1) − P(Y = 1|X = 0))
31. Appendix
Joint Bayesian PS estimation
The likelihood of the PS stage is given by:
L(X|γ, ¯αm
, C) =
N
i=1
g−1
x
p
k=1
¯αm
k γkCk,i
Xi
1 − g−1
x
p
m=1
¯αm
k γkCk,i
1−Xi
and the likelihood of the outcome stage is given by:
L(Y|β, γ, δ, ξ, ¯αm
, X, C) =
N
i=1
g−1
y β0 + βX Xi + ξT
h(PS) +
p
k=1
¯αm
k δkCk,i
Yi
(3)
× 1 − g−1
y β0 + βX Xi + ξT
h(PS) +
p
k=1
¯αm
k δkCk,i
1−Yi
32. Joint Bayesian PS estimation with α unknown
Another consequence of adding α is that the ACE given by equation
turns into a weighted average over different PS and outcome models,
with weights corresponding to the posterior probability of each model.
Formally, let M = {α : α ∈ {0, 1}p} denote the set of all models
being considered where its cardinality is |M| = 2p.
For instance, an element of M is the m-th model
: αm = (αm
1 , ....., αm
p ).
Let p(αm) be the prior probability of the m-th model.
Then, the posterior probability of the m-th model is
p(αm
|data) =
p(αm)p(data|αm)
αi ∈M p(αi )p(data|αi )
33. Joint Bayesian PS estimation with α unknown
Hence, the posterior distribution of the ACE will be a weighted
average of estimates of ACE under each model in M:
p( |data) ≈
αm∈M
p( m
|αm
, data)p(αm
|data)
where m = ECαm {E[Y |X = 1, Cαm ] − E[Y |X = 0, Cαm ]} and Cαm
denotes the subset of C which is included in model αm.
Remark 5. The m is an estimate of the causal effect if and only if
αm contains the necessary confounders to satisfy the assumption of
no unmeasured confounders.
Remark 6. It assumes that each model have equal prior probability,
that is, for all possible α: p(α) = 1
|M|
34. First Stage. Posterior Distributions
Our objective is p(αx|c|Y)
First, we need to compute p(αy|c|Y):
p(αy|c
|Y) ∝ L(Y|αy|c
)p(αy|c
) (4)
L(Y|αy|c) is the marginal likelihood under model αy|c and is equal to:
L(Y|αy|c
) = L(Y|αy|c
, η)p(η|αy|c
)dη (5)
where η is a vector of parameters of the logistic regression parameter
in the prognostic score model for model αy|c and its dimension given
by p
k=1 α
y|c
k
p(η|αy|c) is the prior distribution of parameter η under model αy|c
L(Y|αy|c, η) is the likelihood for model αy|c which involves only the
prognostic score model
35. First Stage. Posterior Distributions
L(Y|αy|c) is not analytically tractable and thus we cannot apply
MC3.
We sample from the joint posterior of p(αy|c, η|Y) applying the
algorithm RJMCMC.
Then we compute the informative prior as follows:
p(αx|c
|Y) =
αy|c ∈My|c
p(αx|c
|αy|c
, Y)p(αy|c
|Y)
=
αx|c Y |αy|c αy|c ∈My|c
p(αx|c
|αy|c
)p(αy|c
|Y) (6)
where the last equality assumes that αx|c Y |αy|c.
36. First Stage. Posterior Distributions
When ω = 1 corresponds to p(αx|c|Y) being a uninformative
prior. Why?
ω = 1 ⇒ p(α
x|c
k = 0|α
y|c
k = 0) = p(α
x|c
k = 1|α
y|c
k = 0) = p(α
x|c
k =
0|α
y|c
k = 1) = p(α
x|c
k = 1|α
y|c
k = 1) = 1
2 for ∀k.
So, we have that
p(αx|c
|αy|c
) =
p
k=1
p(α
x|c
k |α
y|c
k ) =
1
2p
where p is the number of covariates.
Hence,
p(αx|c
|Y) =
1
2p
αy|c ∈My|c
p(αy|c
|Y) =
1
2p
(7)
since αy|c ∈My|c
p(αy|c|Y) = 1.
Thus, p(αx|c|Y) carries no outcome information to the second stage.
37. Second Stage.
It is a Bayesian PS estimation stage, based only on the PS and
outcome model
We incorporate an informative prior for the model indicator of the PS
model, p(αx|c|Y), which is inherited from the first stage.
The setting of Zigler and Dominici 2014 is a particular case of our
setting when ω = 1 ⇒ p(αx|c|Y) = p(αx|c) = 1
p .
The main goal of this stage is to estimate the Average Causal Effect
(ACE) of treatment with X = 1 vs X = 0.
The posterior distribution of the ACE:
p( |data) ≈
αx|c ∈MPS
p( αx|c
|αx|c
, data)p(αx|c
|data) (8)
where, as before,
αx|c
= ECαx|c
{E[Y |X = 1, Cαx|c ] − E[Y |X = 0, Cαx|c ]} and Cαx|c
denotes the subset of C which is included in model αx|c.
38. Second Stage.
Prior Distributions
Similarly to Zigler and Dominici 2014, we use a flat prior distribution
on (β0, βX , ξ, δαx|c , γαx|c ).
Contrast to the previous approach, here p(αx|c|Y) is an informative
prior.
Posterior Distributions
we sample from the joint posterior p(αx|c, β0, βX , ξ, δαx|c , γαx|c |data)
applying the method RJMCMC. This joint posterior is given by:
p(αx|c
, β0, βX , ξ, δαx|c , γαx|c |data) ∝
L(Y, X|αx|c
, θαx|c , C)p(θαx|c |αx|c
)p(αx|c
|Y)
where θαx|c = (β0, βX , ξ, δαx|c , γαx|c ), L(Y, X|αx|c, θαx|c , C) is the
joint likelihood of the PS and outcome model and p(θαx|c |αx|c) and
p(αx|c|Y) are the priors distribution.
39. The marginal likelihood under model αx|c
L(Y, X|αx|c
, C) =
L(Y, X|αx|c
, θαx|c , C)p(θαx|c |αx|c
)p(αx|c
|Y)dθαx|c
will not have an analytically tractable expression to be used to
compute p(αx|c|Y, X), the amount needed to apply MC3.
40. RJMCMC
RJMCMC was proposed by Green 1995 as an extension of the
Metropolis-Hastings algorithm that allows to create a reversible
Markov chain that can “jump” between models with parameter
spaces of different dimensions (trans-dimensional Markov chains)
retaining the detailed balance condition which guarantee the correct
limiting distribution.
The standard Metropolis-Hastings within Gibbs sampling algorithm
cannot be applied because when we condition on one model, let say
αx|c, then (β0, βX , ξ, δαx|c , γαx|c ) ∈ Θαx|c but when we condition on
(β0, βX , ξ, δαx|c , γαx|c ), then αx|c cannot move and we cannot move
between models.
We need to complete the spaces or supplement each of them with an
artificial space in order to make them compatible. In other words, we
need to create a bijection between them.
41. Outline RJMCMC
Step 1) Update the parameters that are in the current model for
example using Metropolis-Hastings algorithm.
Step 2.a). Generate a proposed variable j ∈ {1, 2, ...., p} to add or
delete from the model with probability 1/p. Thus, we propose to
change α to α where αj = 1 − αj
Step 2.c). If αj = 0 → αj = 1 (include covariate j in the model)
i) Generate the additional parameter u corresponding to variable j from
a proposal density u ∼ ϕ(u)
ii) Set θα = (θα,(−j), uα,(j))
iii) Accept the proposed move with probability
∆{(α, θα) → (α , θα
)} =
min
L(data|θα
, α )p(θα
|α )p(α )
L(data|θα, α)p(θα|α)p(α)ϕ(u)
, 1
42. iv) If the proposed move is accepted, update α and θα by α and θα
.
Otherwise, leave α and θα unchanged.
Step 2.b).If αj = 1 → αj = 0 (exclude covariate j in the model)
i) Set θα = θα,(−j), it is equal to the corresponding parameter of θα
ii) Accept the proposed move with probability
∆{(α, θα) → (α , θα
)} =
min
L(data|θα
, α )p(θα
|α )p(α )ϕ(u)
L(data|θα, α)p(θα|α)p(α)
, 1
iv) If the proposed move is accepted, update α and θα by α and θα
.
Otherwise, leave α and θα unchanged.
43. The difference between the first and second stage in terms of the
RJMCMC algorithm lies mainly in:
the likelihood L(data|θα, α).
The likelihood of the the first stage L(data|θαy|c , αy|c) is only based
on the prognostic score
On the other hand, the likelihood in the second stage
L(data|θαx|c , αx|c) is based jointly on the PS and outcome model
the ratio p(α )
p(α) in the acceptance probability of the proposed
move.
In the first stage and the second stage for ω = 1, this cancels out
since each model has equal prior probability.
On the other hand, this ratio does appear in the acceptance
probability in the second stage for ω > 1. This is due to the fact that
p(α x|c|Y) is an informative prior and thus this ratio p(α x|c |Y)
p(αx|c |Y)
will not
cancel out.