• Treatment regimes for multiple decision points (definition, recursive representation)
• Statistical framework (potential outcomes, feasible set of treatments and classes of regimes, value, data and data sources, identifiability assumptions
• The g-computation algorithm
• Estimation of of the value of a fixed regime (via g-computation, IPW/AIPW estimators, marginal structural models)
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
2019 PMED Spring Course - Multiple Decision Treatment Regimes: Fundamentals - Marie Davidian, February 13, 2019
1. 5. Multiple Decision Treatment Regimes: Framework and
Fundamentals
5.1 Multiple Decision Treatment Regimes
5.2 Statistical Framework
5.3 The g-Computation Algorithm
5.4 Estimation of the Value of a Fixed Regime
5.5 Key References
216
2. Recall: Acute Leukemia
• At baseline: Information x1, history h1 = x1 ∈ H1
• Decision 1: Set of options A1 = {C1,C2}; rule 1: d1(h1): H1 → A1
• Between Decisions 1 and 2: Collect additional information x2, including
responder status
• Accrued information/history h2 = (x1, therapy at decision 1, x2) ∈ H2
• Decision 2: Set of options A2 = {M1,M2,S1,S2}; rule 2:
d2(h2): H2 → {M1,M2} (responder), d2(h2): H2 → {S1,S2} (nonresponder)
• Treatment regime : d = {d1(h1), d2(h2)} = (d1, d2)
217
3. K decision treatment regime
In general: K decision points/stages
• Baseline information x1 ∈ X1, intermediate information xk ∈ Xk
between Decisions k − 1 and k, k = 2, . . . , K
• Treatment options Ak at Decision k, elements ak ∈ Ak ,
k = 1, . . . , K
• Accrued information or history
h1 = x1 ∈ H1
hk = (x1, a1, . . . , xk−1, ak−1, xk ) ∈ Hk , k = 2, . . . , K,
• Decision rules d1(h1), d2(h2), . . . , dK (hK ), dk : Hk → Ak
• Treatment regime
d = {d1(h1), . . . , dK (hK )} = (d1, d2, . . . , dK )
218
4. Notation
Overbar notation: With
xk ∈ Xk , ak ∈ Ak , k = 1, . . . , K
it is convenient to define for k = 1, . . . , K
xk = (x1, . . . , xk ), ak = (a1, . . . , ak )
Xk = X1 × · · · × Xk , Ak = A1 × · · · × Ak
• Conventions: a = aK , x = xK , a0 is null
• h1 = x1, hk = (xk , ak−1), k = 2, . . . , K
• H1 = X1, Hk = Xk × Ak−1, k = 2, . . . , K
• dk = (d1, . . . , dk ), k = 1, . . . , K, d = dK = (d1, . . . , dK )
219
5. Decision points
Decision points depend on the context: For example
• Acute leukemia: Decisions are at milestones in the disease
progression (diagnosis, evaluation of response)
• HIV infection: Decisions are according to a schedule (at monthly
clinic visits)
• Cardiovascular disease (CVD): Decisions are made upon
occurrence of an event (myocardial infarction)
Perspective:
• Timing of decisions can be fixed (HIV) or random (leukemia)
• If any individual would reach all K decision points, the distinction
is not important; we assume this going forward
• If different individuals can experience different numbers of
decisions (CVD), the number of decisions reached is random,
and a more specialized framework is required
• The latter also is the case if the outcome is a time to an event
220
7. Recursive representation of rules
Fact: If the K rules in d ∈ D are followed by an individual, the options
selected at each decision depend only on the evolving x1, . . . , xK
• Decision 1: Option selected is d1(h1) = d1(x1)
• Between Decisions 1 and 2: x2
• Decision 2: Rule d2(h2) = d2(x2, a1), option selected depends
on the option selected at Decision 1
d2{x2, d1(x1)}
• Between Decisions 2 and 3: x3
• Decision 3: Rule d3(h3) = d3(x3, a2) = d3(x3, a1, a2), option
selected depends on those at Decisions 1 and 2
d3[x3, d1(x1), d2{x2, d1(x1)}]
• And so on. . .
222
8. Recursive representation of rules
Concise representation: For k = 2, . . . , K
d2(x2) = [d1(x1), d2{x2, d1(x1)}]
d3(x3) = [d1(x1), d2{x2, d1(x1)}, d3{x3, d2(x2)}]
... (5.2)
dK (xK ) = [d1(x1), d2{x2, d1(x1)}, . . . , dK {xK , dK−1(xK−1)}]
• dk (xk ) comprises the options selected through Decision k
• d(x) = dK (xK )
• This representation will be useful later
223
9. Redundancy
From the perspective of an individual following the K rules in d:
• The definition of rules dk (hk ) = dk (xk , ak−1) is redundant
• a1 is determined by x1, a2 is determined by x2 = (x1, x2), . . .
• But definition of dk as a function of hk = (xk , ak−1) is useful for
characterizing and estimating an optimal regime later
Illustration of redundancy: K = 2, A1 = {0, 1}, A2 = {0, 1},
X1 = {0, 1}, X2 = {0, 1} (x1 and x2 are binary)
• d = (d1, d2) ∈ D, d1 : H1 = X1 → A1, d2 : H2 = X2 × A1 → A2
• For each value of h1 = x1, d1 must return a value in A1, e.g.,
d1(x1 = 0) = 0, d1(x1 = 1) = 1
224
10. Redundancy
Illustration of redundancy, continued:
• For each of the 23 = 8 possible values of h2 = (x1, x2, a1), d2
must return a value in A2, e.g.,
d2(x1 = 0, x2 = 0, a1 = 0) = 0
d2(x1 = 0, x2 = 0, a1 = 1) = 1∗
d2(x1 = 0, x2 = 1, a1 = 0) = 1
d2(x1 = 0, x2 = 1, a1 = 1) = 1∗
d2(x1 = 1, x2 = 0, a1 = 0) = 0∗
d2(x1 = 1, x2 = 0, a1 = 1) = 1
d2(x1 = 1, x2 = 1, a1 = 0) = 0∗
d2(x1 = 1, x2 = 1, a1 = 1) = 0
• Configurations with ∗ could never occur if an individual followed
regime d
225
11. 5. Multiple Decision Treatment Regimes: Framework and
Fundamentals
5.1 Multiple Decision Treatment Regimes
5.2 Statistical Framework
5.3 The g-Computation Algorithm
5.4 Estimation of the Value of a Fixed Regime
5.5 Key References
226
12. Outcome of interest
Determination of outcome:
• Can be ascertained after Decision K, e.g., for HIV infected
patients, outcome = viral load (viral RNA copies/mL) measured
at a final clinic visit after Decision K
• Can be defined using intervening information, e.g., for HIV
infected patients with CD4 count (cells/mm3) measured at each
clinic visit (Decisions k = 2, . . . , K) and at a final clinic visit after
Decision K
outcome = total # CD4 counts > 200 cells/mm3
Convention: As for single decision case, assume larger outcomes
are preferred
• E.g., because smaller viral load is better, take
outcome = −viral load
227
13. Potential outcomes for K decisions
Intuitively: For a randomly chosen individual with history X1
• If he/she were to receive a1 ∈ A1 at Decision 1, the evolution of
his/her disease/disorder process after Decision 1 would be be
influenced by a1
• Suggests: X*
2(a1) = intervening information that would arise
between Decisions 1 and 2 after receiving a1 ∈ A1 at Decision 1
• If he/she then were to receive a2 ∈ A2 at Decision 2, the
evolution of his/her disease/disorder process after Decision 2
would be influenced by a1 followed by a2
• Suggests: X*
3(a2) = intervening information that would arise
between Decisions 2 and 3 after receiving a1 ∈ A1 at Decision 1
and a2 at Decision 2
• And so on. . .
228
14. Potential outcomes for K decisions
Ultimately: If he/she were to receive options a = aK = (a1, . . . , aK )
at Decisions 1,. . . ,K
Y*
(aK ) = Y*
(a) = outcome that would be achieved
Summarizing: Potential information at each decision and potential
outcome if an individual were to receive a = (a1, . . . , aK )
{X1, X*
2(a1), X*
3(a2), . . . , X*
K (aK−1), Y*
(a)}
• X1 may or may not be included
• Set of all possible potential outcomes for all a ∈ A
W*
= X*
2(a1), X*
3(a2), . . . , X*
K (aK−1), Y*
(a),
for a1 ∈ A1, a2 ∈ A2, . . . , aK−1 ∈ AK−1, a ∈ A (5.3)
229
15. Feasible sets and classes of treatment regimes
Example: Acute leukemia, K = 2
A1 = {C1,C2}, A2 = {M1, M2, S1, S2}
• At Decision 1, both options are feasible for all patients
• At Decision 2, only maintenance options are feasible for patients
who respond, and only salvage options are feasible for patients
who do not respond
• Is the case almost always in multiple decision problems at
decision points other than Decision 1
• Of course is also possible at Decision 1
230
16. Feasible sets and classes of treatment regimes
Formal specification: For any history hk = (xk , ak−1) ∈ Hk at
Decision k, k = 1, . . . , K
• The set of feasible treatment options at Decision k is
Ψk (hk ) = Ψk (xk , ak−1) ⊆ Ak , k = 1, . . . , K (5.4)
Ψ1(h1) = Ψ1(x1) ⊆ A1 (a0 null)
• Ψk (hk ) is a function mapping Hk to the set of all possible
subsets of Ak
• Ψk (hk ) can be a strict subset of Ak or all of Ak , depending on hk
• Collectively: Ψ = (Ψ1, . . . , ΨK )
231
17. Feasible sets and classes of treatment regimes
Example, revisited: Acute leukemia, K = 2
A1 = {C1,C2}, A2 = {M1, M2, S1, S2}
• Suppose C1 and C2 are feasible for all individuals regardless of
h1
Ψ1(h1) = A1 for all h1
• If h2 indicates response
Ψ2(h2) = {M1, M2} ⊂ A2 for all such h2
• If h2 indicates nonresponse
Ψ2(h2) = {S1, S2} ⊂ A2 for all such h2
• More complex specifications of feasible sets that take account of
additional information are of course possible
232
18. Feasible sets and classes of treatment regimes
Fancier example: Acute leukemia, K = 2
A1 = {C1,C2}, A2 = {M1, M2, S1, S2}
• If C1 is contraindicated for patients with renal impairment
Ψ1(h1) = {C2} if h1 indicates renal impairment
= {C1,C2} = A1 if h1 does not
• If S1 increases risk of adverse events in nonresponders with low
WBC2
Ψ2(h2) = {S1} if h2 indicates nonresponse, WBC2 ≤ w
= {S1, S2} if h2 indicates nonresponse, WBC2 > w
for some threshold w (×103/µl)
233
19. Feasible sets and classes of treatment regimes
Ideally:
• Specification of the feasible sets is dictated by scientific
considerations
• Disease/disorder context, available treatment options, patient
population, etc
• Specification of Ψk (hk ), k = 1, . . . , K, should incorporate only
information in hk that is critical to treatment selection
Regimes and feasible sets: Given specified feasible sets Ψ
• A regime d = (d1, . . . , dK ) whose rules select treatment options
for history hk from those in Ψk (hk ) is defined in terms of Ψ
• Thus, regimes are Ψ-specific, and the relevant class of all
possible regimes D depends on Ψ (suppressed in the notation)
234
20. Feasible sets and classes of treatment regimes
In practice: At Decision k, there is a small number k of subsets
Ak,l ⊆ Ak , l = 1, . . . , k that are feasible sets for all hk
• E.g, for acute leukemia, 2 = 2
A2,1 = {M1,M2}, A2,2 = {S1,S2}
• If r2 is the component of h2 indicating response
Ψ1(h2) = A2,1 for h2 with r2 = 1
= A2,2 for h2 with r2 = 0
Decision rules: Different rule for each subset
• E.g., for acute leukemia, as in (5.1)
d2(h2) = I(r2 = 1) d2,1(h2) + I(r2 = 0) d2,2(h2)
d2,1(h2) is a rule selecting maintenance therapy for responders
d2,2(h2) is a rule selecting salvage therapy for nonresponders
235
21. Feasible sets and classes of treatment regimes
In general: With k distinct subsets Ak,l ⊆ Ak , l = 1, . . . , k , as
feasible sets
• Define sk (hk ) = 1, . . . , k according to which of these subsets
Ψk (hk ) corresponds for given hk
• dk,l(hk ) is the rule corresponding to the lth subset Ak,l
• Decision rule at Decision k has form
dk (hk ) =
k
l=1
I{sk (hk ) = l} dk,l(hk ) (5.5)
• Henceforth, it is understood that dk (hk ) may be expressed as in
(5.5) where appropriate
236
22. Feasible sets and classes of treatment regimes
Note: For any Ψ-specific regime d = (d1, . . . , dK )
• At Decision k, dk (hk ) = dk (xk , ak−1) returns only options in
Ψk (hk ) = Ψk (xk , ak−1), i.e.,
dk (hk ) = dk (xk , ak−1) ∈ Ψk (hk ) ⊆ Ak
• Thus, dk need map only a subset of Hk = Xk × Ak−1 to Ak
• We discuss this more shortly
237
23. Potential outcomes for a fixed regime d ∈ D
Intuitively: If a randomly chosen individual with history X1 were to
receive treatment options by following the rules in d
• Decision 1: Treatment determined by d1
• X*
2(d1) = intervening information that would arise between
Decisions 1 and 2
• Decision 2: Treatment determined by d2
• X*
3(d2) = intervening information that would arise between
Decisions 2 and 3 ...
• X*
k (dk−1) = intervening information that would arise between
Decisions k − 1 and k, k = 2, . . . , K
• Y*(d) = Y*(dK ) = outcome that would be achieved if all rules in
d were followed
Potential outcomes under regime d:
{X1, X*
2(d1), X*
3(d2), . . . , X*
K (dK−1), Y*
(d)} (5.6)
238
24. Potential outcomes for a fixed regime d ∈ D
Formally: These potential outcomes are functions of W* in (5.3)
• Define
X*
k (ak−1) = {X1, X*
2(a1), X*
3(a2), . . . , X*
k (ak−1)}, k = 2, . . . , K
• Then
X*
2(d1) =
a1∈A1
X*
2(a1)I{d1(X1) = a1}
X*
k (dk−1) =
ak−1∈Ak−1
X*
k (ak−1)
k−1
j=1
I dj {X*
j (aj−1), aj−1} = aj (5.7)
k = 3, . . . , K
Y*
(d) =
a∈A
Y*
(a)
K
j=1
I dj {X*
j (aj−1), aj−1} = aj
• Also define
X*
k (dk−1) = {X1, X*
2(d1), X*
3(d2), . . . , X*
k (dk−1)}, k = 2, . . . , K
239
25. Value of a K-decision regime
With these definitions: The value of d ∈ D is
V(d) = E{Y*
(d)}
• And an optimal regime dopt ∈ D satisfies
E{Y*
(dopt
)} ≥ E{Y*
(d)} for all d ∈ D (5.8)
equivalently
dopt
= arg max
d∈D
E{Y*
(d)} = arg max
d∈D
V(d)
240
26. Optimal treatment options vs. optimal decisions
For a randomly chosen individual with history H1:
• The optimal sequence of treatment options for this individual is
arg max
a∈A
Y*
(a)
which of course is not knowable in practice
• All that is known at baseline is H1, and dopt
1 , . . . , dopt
K select the
(feasible) options corresponding to the largest expected outcome
• From the definition of Y*(d) in (5.7),
Y*
(d) ≤ max
a∈A
Y*
(a) for all d ∈ D =⇒ Y*
(dopt
) ≤ max
a∈A
Y*
(a)
so an optimal regime might not select optimal options for this
individual at each decision
• Rather, dopt leads to the optimal sequence of decisions that can
be made based on the available information on this individual
241
27. Data
Ideally: For i = 1, . . . , n, i.i.d.
(X1i, A1i, X2i, A2i, . . . , XKi, AKi, Yi) = (XKi, AKi, Yi) = (Xi, Ai, Yi) (5.9)
• X1 = baseline information at Decision 1, taking values in X1
• Ak = treatment option actually received at Decision k,
k = 1, . . . , K, taking values in Ak
• Xk = intervening information between Decisions k − 1 and k,
k = 2, . . . , K, taking values in Xk
• Xk = (X1, . . . , Xk ), X = XK = (X1, . . . , XK ), and
Ak = (A1, . . . , Ak ), A = AK = (A1, . . . , AK )
• History H1 = X1, Hk = (X1, A1, . . . , Xk−1, Ak−1, Xk ) = (Xk , Ak−1),
k = 2, . . . , K
• Y = observed outcome (after Decision K or function of HK )
242
28. Data
Data sources:
• Longitudinal observational study: Retrospective from an existing
database, completed conventional clinical trial with followup,
prospective cohort study
• Randomized study: Prospective clinical trial conducted
specifically for this purpose (SMART)
Longitudinal observational study: Challenges
• Time-dependent confounding: A subject’s history at each
decision point is both determined by past treatments and used to
select future treatments
• Characteristics associated with both treatment selection and
future characteristics/ultimate outcome may not be captured in
the data
243
30. Data
Do we really need data like these? Can’t we just “piece together ”
an optimal regime using single decision methods on data from
separate studies (with different subjects in each)?
• E.g., acute leukemia: Estimate dopt
1 from a study comparing
{C1,C2} and dopt
2 from separate studies comparing {M1, M2}
and {S1, S2}
• Delayed effects: The induction therapy with the highest
proportion of responders might have other effects that render
subsequent treatments less effective in regard to survival
• Require data from a study involving the same subjects over the
entire sequence of decisions
245
31. Statistical problem
Ultimate goal: Based on the data (5.9), estimate dopt ∈ D satisfying
(5.8), i.e.,
E{Y*
(dopt
)} ≥ E{Y*
(d)} for all d ∈ D
Challenge: dopt is defined in terms of the potential outcomes (5.6)
• Must be able to express this definition in terms of the observed
data (5.9)
• In particular, for any d ∈ D, must be able to identify the
distribution of
{X1, X*
2(d1), X*
3(d2), . . . , X*
K (dK−1), Y*
(d)}
which depends on that of (X1, W*), from the distribution of
(X1, A1, X2, A2, . . . , XK , AK , Y)
• Possible under the following assumptions generalizing those in
(3.4), (3.5), and (3.6)
246
32. Identifiability assumptions
SUTVA (consistency):
Xk = X*
k (Ak−1) =
ak−1∈Ak−1
X*
k (ak−1)I(Ak−1 = ak−1), k = 2, . . . , K
(5.10)
Y = Y*
(A) =
a∈A
Y*
(a)I(A = a)
• Observed intervening information and the final observed
outcome are those that would potentially be seen under the
treatments actually received at each decision point
• Are the same (consistent) regardless of how the treatments are
administered at each decision point
247
33. Identifiability assumptions
Sequential randomization assumption (SRA): Robins (1986)
W*
⊥⊥ Ak |Xk , Ak−1, k = 1, . . . , K, where A0 is null (5.11)
equivalently
W*
⊥⊥ Ak |Hk , k = 1, . . . , K
• (5.11) is the strongest version; weaker versions require
treatment selection ⊥⊥ only of future potential outcomes
• Unverifiable from the observed data
• Unlikely to hold for data from a longitudinal observational study
not carried out with estimation of treatment regimes in mind
• But (5.11) holds by design in a SMART
248
34. Identifiability assumptions
Positivity assumption: With feasible sets of treatment options at
each decision point, more complicated
• Intuitively: To identify the distribution of the potential outcomes
(5.6) from that of the observed data (5.9), all treatment options in
the feasible sets Ψk (hk ) = Ψk (xk , ak−1) must be represented in
the observed data, k = 1, . . . , K
• That is, there must be individuals in the data who received each
of the options in Ψk (hk ), k = 1, . . . , K
• E.g, acute leukemia: Decision 2: there must be responders who
received each of M1 and M2 and nonresponders who received
each of S1 and S2
Built up recursively. . .
249
35. Identifiability assumptions
Decision 1: Set of all possible baseline info h1 = x1
Γ1 = {x1 ∈ X1 satisfying P(X1 = x1) > 0} ⊆ X1 = H1
Set of all possible histories and associated options in Ψ1(h1)
Λ1 = {(x1, a1) such that x1 = h1 ∈ Γ1, a1 ∈ Ψ1(h1)}
All options in Λ1 must be represented in the data
P(A1 = a1|H1 = h1) > 0 for all (h1, a1) ∈ Λ1 (5.12)
First component of the positivity assumption
250
36. Identifiability assumptions
Decision 2: All possible histories h2 consistent with following a
Ψ-specific regime at Decision 1
Γ2 = (x2, a1) ∈ X2 × A1 satisfying (x1, a1) ∈ Λ1 and
P{X*
2(a1) = x2 | X1 = x1} > 0 ⊆ H2
Under SUTVA (5.10) and SRA (5.11), equivalently
Γ2 = (x2, a1) ∈ X2 × A1 satisfying (x1, a1) ∈ Λ1 and
P(X2 = x2 | X1 = x1, A1 = a1) > 0
Set of all possible histories h2 and associated options in Ψ2(h2)
Λ2 = {(x2, a2) such that (x2, a1) = h2 ∈ Γ2, a2 ∈ Ψ2(h2)}
251
37. Identifiability assumptions
Decision 2, continued: All options in Λ2 must be represented in the
data
P(A2 = a2 | H2 = h2) > 0 for all (h2, a2) ∈ Λ2
Second component of the positivity assumption
...
Decision k: All histories hk consistent with a Ψ-specific regime
through Decision k − 1
Γk = (xk ,ak−1) ∈ Xk × Ak−1 satisfying (xk−1, ak−1) ∈ Λk−1 and
P{X*
k (ak−1) = xk | X*
k−1(ak−2) = xk−1} > 0 ⊆ Hk (5.13)
= (xk ,ak−1) ∈ Xk × Ak−1 satisfying (xk−1, ak−1) ∈ Λk−1 and
P(Xk = xk | Xk−1 = xk−1, Ak−1 = ak−1) > 0 (5.14)
252
38. Identifiability assumptions
Decision k, continued: Set of all possible histories hk and
associated options in Ψk (hk )
Λk = {(xk , ak ) such that (xk , ak−1) = hk ∈ Γk , ak ∈ Ψk (hk )}
All options in Ψk (hk ) must be represented in the data
P(Ak = ak | Hk = hk ) > 0 for all (hk , ak ) ∈ Λk
kth component of the positivity assumption
...
253
39. Identifiability assumptions
Positivity assumption: Summarizing
P(Ak = ak |Hk = hk ) = P(Ak = ak |Xk = xk , Ak−1 = ak−1) > 0
for hk = (xk , ak−1) ∈ Γk and ak ∈ Ψk (hk ) = Ψk (xk , ak−1),
k = 1, . . . , K (5.15)
• The positivity assumption (5.15) holds in a SMART by design if
there are subjects with history hk randomized to all options in
Ψk (hk ) at each Decision k = 1, . . . , K
• No guarantee that for a given Ψ (5.15) holds for data from a
longitudinal observational study (more shortly)
254
40. Identifiability assumptions
Equivalence of (5.13) and (5.14): Assuming SUTVA, SRA, need to
show for any hk = (xk , ak−1) ∈ Γk in (5.13)
P(Xk = xk | Xk−1 = xk−1, Ak−1 = ak−1)
= P{X*
k (ak−1) = xk | X*
k−1(ak−2) = xk−1} (5.16)
• Proof is by induction (k = 1, 2 are immediate)
• Repeated use of the following lemma
Lemma. Let A and H be random variables, assume
W* ⊥⊥ A|H, and consider two functions f1(W*) and f2(W*) of
W*. If the event {f2(W*) = f2, H = h, A = a} has positive
probability, then
P{f1(W*
) = f1|f2(W*
) = f2, H = h, A = a}
= P{f1(W*
) = f1|f2(W*
) = f2, H = h}
255
41. Identifiability assumptions
Sketch of induction proof:
• We need to show for any hk = (xk , ak−1) ∈ Γk in (5.13), k = 1, . . . , K
P(Xk = xk | Xk−1 = xk−1, Ak−1 = ak−1)
= P{X*
k (ak−1) = xk | X*
k−1(ak−2) = xk−1} (5.16)
• (5.16) is trivial for k = 1
• k = 2: (5.12) implies P(X1 = x1, A1 = a1) > 0 for (x1, a1) ∈ Λ1, so
P(X2 = x2 | X1 = x1, A1 = a1) is well defined. Then by SUTVA and
SRA, (5.16) holds
P(X2 = x2 | X1 = x1, A1 = a1) = P{X*
2(a1) = x2 | X1 = x1, A1 = a1}
= P{X*
2(a1) = x2 | X1 = x1}
• For general k: Need to show P(Xk−1 = xk−1, Ak−1 = ak−1) > 0, so
that P(Xk = xk | Xk−1 = xk−1, Ak−1 = ak−1) is well defined, and then
show (5.16)
256
42. Identifiability assumptions
• Assume this holds for k − 1 and then show it holds for k. Thus, for
hk−1 = (xk−1, ak−2) ∈ Γk−1, P(Xk−2 = xk−2, Ak−2 = ak−2) > 0 and
P(Xk−1 = xk−1 | Xk−2 = xk−2, Ak−2 = ak−2)
= P{X*
k−1(ak−2) = xk−1 | X*
k−2(ak−3) = xk−2} > 0
• It follows that P(Xk−1 = xk−1, Ak−2 = ak−2) > 0
• Because hk−1 = (xk−1, ak−2) ∈ Γk−1 and ak−1 ∈ Ψk−1(xk−1, ak−2),
P(Ak−1 = ak−1 | Xk−1 = xk−1, Ak−2 = ak−2) > 0
• It then follows that P(Xk−1 = xk−1, Ak−1 = ak−1) > 0 as required
• Now show (5.16) using SUTVA, SRA, and the Lemma
257
43. Identifiability assumptions
• By repeated use of SUTVA and SRA
P(Xk = xk | Xk−1 = xk−1, Ak−1 = ak−1)
= P{X*
k (ak−1) = xk |Xk−2 = xk−2, X*
k−1(ak−2) = xk−1
Ak−3 = ak−3, Ak−2 = ak−2}, (5.17)
• By the Lemma with f1(W*
) = X*
k (ak−1), A = Ak−2, H = (Xk−2, Ak−3),
and f2(W*
) = X*
k−1(ak−2), (5.17) =
P{X*
k (ak−1) = xk |Xk−2 = xk−2, X*
k−1(ak−2) = xk−1, Ak−3 = ak−3}
= P{X*
k (ak−1) = xk |Xk−3 = xk−3, X*
k−2(ak−3) = xk−2
X*
k−1(ak−2) = xk−1, Ak−4 = ak−4, Ak−3 = ak−3} (5.18)
• By the Lemma with f1(W*
) = X*
k (ak−1), A = Ak−3, H = (Xk−3, Ak−4),
(5.18) =
P{X*
k (ak−1) = xk |Xk−3 = xk−3, X*
k−2(ak−3) = xk−2,
X*
k−1(ak−2) = xk−1, Ak−4 = ak−4}
258
44. Identifiability assumptions
• Continuing to apply the Lemma and SUTVA leads to (5.16),
P(Xk = xk | Xk−1 = xk−1, Ak−1 = ak−1)
= P{X*
k (ak−1) = xk | X*
k−1(ak−2) = xk−1}
• Because hk = (xk , ak−1) ∈ Γk , the RHS is > 0
• Thus, we have shown that (5.16) holds for k and
P(Xk = xk | Xk−1 = xk−1, Ak−1 = ak−1) > 0
for hk = (xk , ak−1) ∈ Γk
• Because this holds for k = 1, 2, the result follows by induction
259
45. Identifiability assumptions
More precise definition of a regime: A Ψ-specific regime
d = (d1, . . . , dK ) satisfies
• Each rule dk , k = 1, . . . , K, is a mapping from Γk ⊆ Hk into Ak
for which dk (hk ) ∈ Ψk (hk ) for every hk ∈ Γk
• The class D of Ψ-specific regimes is the set of all such d
260
46. Identifiability assumptions
Observational data: For given Ψ, no guarantee that all options in
Ψk (hk ), k = 1, . . . , K, are represented in the data
• Define for k = 1, . . . , K
Γmax
k = (xk , ak−1) ∈ Xk × Ak−1 satisfying (xk−1, ak−1) ∈ Λmax
k−1
and P(Xk = xk | Xk−1 = xk−1, Ak−1 = ak−1) > 0
Ψmax
k (hk ) ={ak ∈ Ak satisfying P(Ak = ak |Hk = hk ) > 0
for all hk = (xk , ak−1) ∈ Γmax
k }
Λmax
k = {(xk , ak ) such that (xk , ak−1) = hk ∈ Γmax
k , ak ∈ Ψmax
k (hk )}
• The class of regimes based on Ψmax = (Ψmax
1 , . . . , Ψmax
k ) is the
largest that can be considered
• So must have Ψk (hk ) ⊆ Ψmax
k (hk ), k = 1, . . . , K for all
hk ∈ Γk ⊆ Γmax
k
261
47. 5. Multiple Decision Treatment Regimes: Framework and
Fundamentals
5.1 Multiple Decision Treatment Regimes
5.2 Statistical Framework
5.3 The g-Computation Algorithm
5.4 Estimation of the Value of a Fixed Regime
5.5 Key References
262
48. Identifiability result
Goal, again: For any Ψ-specific regime d ∈ D, demonstrate that we
can identify the distribution of
{X1, X*
2(d1), X*
3(d2), . . . , X*
K (dK−1), Y*
(d)}
which depends on that of (X1, W*), from the distribution of
(X1, A1, X2, A2, . . . , XK , AK , Y)
Recall: Recursive representation dk (xk ) of the treatment options
selected by d through Decision k in (5.2) if an individual follows d,
k = 2, . . . , K
dk (xk ) = [d1(x1), d2{x2, d1(x1)}, . . . , dk {xk , dk−1(xk−1)}]
263
49. g-Computation algorithm
Main result: Under SUTVA (5.10), SRA (5.11), and positivity
assumption (5.15), the joint density of the potential outcomes
{X1, X*
2(d1), X*
3(d2), . . . , X*
K (dK−1), Y*(d)} can be obtained as
pX1,X*
2(d1),X*
3(d2),...,X*
K
(dK−1),Y*(d)(x1, . . . , xK , y) (5.19)
= pY|X ,A{y|x, d(x)}
× pXK |XK−1,AK−1
{xK |xK−1, dK−1(xK−1)}
... (5.20)
× pX2|X1,A1
{x2|x1, d1(x1)}
× pX1
(x1)
for any realization (x1, x2, . . . , xK , y) for which (5.19) is positive
• Due to Robins (1986, 1987, 2004)
264
50. g-Computation algorithm
Additional definition: Relevant realizations (x1, x2, . . . , xK , y) are
determined by feasible sets
• Assume Y*(a) and Y take values y ∈ Y
• Define
ΓK+1 = (x,a, y) ∈ X × A × Y satisfying (x, a) ∈ ΛK and
P{Y*
(a) = y | X*
K (aK−1) = xK } > 0
= (x,a, y) ∈ X × A × Y satisfying (x, a) ∈ ΛK and
P(Y = y | X = x, AK−1 = aK−1) > 0
• This equality can be shown similarly to that of (5.13) and (5.14)
265
51. g-Computation algorithm
Simplification: Take all random variables discrete, so that (5.19) and
(5.20) become
P{X1 = x1, X*
2(d1) = x2, . . . , X*
K (dK−1) = xK , Y*
(d) = y} (5.21)
= P{Y = y | XK = xK , AK = dK (xK−1)}
× P(XK = xK | XK−1 = xK−1, AK−1 = dK−1(xK−2)}
... (5.22)
× P{X2 = x2 | X1 = x1, A1 = d1(x1)}
× P(X1 = x1)
for any realization (x1, x2, . . . , xK , y) such that
P{X1 = x1, X*
2(d1) = x2, . . . , X*
K (dK−1) = xK , Y*
(d) = y} > 0
• Need to show (5.21) = (5.22)
266
52. g-Computation algorithm
Demonstration: Factorize (5.21) as
P{X1 = x1, X*
2(d1) = x2, . . . , X*
K (dK−1) = xK , Y*
(d) = y}
= P{Y*
(d) = y | X*
K (dK−1) = xK }
× P{X*
K (dK−1) = xK | X*
K−1(dK−2) = xK−1}
...
× P{X*
2(d1) = x2 | X1 = x1}
× P(X1 = x1)
• All components on RHS are positive because (5.21) > 0
267
53. g-Computation algorithm
• From (5.22), it suffices to show
P{Y*
(d) = y | X*
K (dK−1) = xK }
= P{Y = y | XK = xK , AK = dK (xK−1)}
where P{Y = y | XK = xK , AK = dK (xK−1)} > 0
and for k = 2, . . . , K
P{X*
k (dk−1) = xk | X*
k−1(dk−2) = xK−1}
= P(Xk = xk | Xk−1 = xk−1, Ak−1 = dk−1(xk−2)}
where P(Xk = xk | Xk−1 = xk−1, Ak−1 = dk−1(xk−2)} > 0
• These follow immediately if, for X*
K+1(d) = Y*(d)
{xk , dk−1(xk−1)} ∈ Γk , k = 2, . . . , K + 1 (5.23)
which can be shown by induction (first show for k = 2)
268
54. g-Computation algorithm
Sketch of induction proof: First take k = 2
• Because P(X1 = x1) > 0, x1 ∈ Γ1 and d is a Ψ-specific regime,
d1(x1) ∈ Ψ1(h1), so that {x1, d1(x1)} ∈ Λ1
• Because (5.21) > 0,
P{X*
2(d1) = x2 | X1 = x1} > 0 =⇒ {x2, d1(x1)} ∈ Γ2
which is (5.23) for k = 2
• Now assume {xk−1, dk−2(xk−2)} ∈ Γk−1 is true
• Because d is a Ψ-specific regime, dk−1(xk−1) ∈ Ψk−1(hk ) and thus
{xk−1, dk−1(xk−1)} ∈ Λk−1
• Because (5.21) > 0,
P{X*
k (dk−1) = xk | X*
k−1(dk−2) = xk−1} > 0 =⇒ {xk , dk−1(xk−1)} ∈ Γk
completing the induction proof
269
55. g-Computation algorithm
General result: Compactly stated
pX1,X*
2(d1),X*
3(d2),...,X*
K
(dK−1),Y*(d)(x1, . . . , xK , y) (5.24)
= pY|X ,A{y|x, dK (x)}
K
k=2
pXk |Xk−1,Ak−1
{xk |xk−1, dk−1(xk−1)} pX1
(x1)
which implies, for example,
pY*(d)(y) =
X
pY|X ,A{y|x, d(x)} (5.25)
×
K
k=2
pXk |Xk−1,Ak−1
{xk |xk−1, dk−1(xk−1)} pX1
(x1) dνK (xK ) · · · dν1(x1)
• dνK (xK ) · · · dν1(x1) is the dominating measure
270
56. g-Computation algorithm
Or the value
E{Y*
(d)} =
X
E{Y|X = x, A = d(x)} (5.26)
×
K
k=2
pXk |Xk−1,Ak−1
{xk |xk−1, dk−1(xk−1)} pX1
(x1) dνK (xK ) · · · dν1(x1)
• Thus, the value V(d) = E{Y*
(d)} of a regime d ∈ D can be expressed
in terms of the observed data
• So it should be possible to estimate V(d) from these data
• As well as to estimate V(dopt
) (later). . .
271
57. 5. Multiple Decision Treatment Regimes: Framework and
Fundamentals
5.1 Multiple Decision Treatment Regimes
5.2 Statistical Framework
5.3 The g-Computation Algorithm
5.4 Estimation of the Value of a Fixed Regime
5.5 Key References
272
58. Fixed regime d ∈ D
Of interest: Estimation of the value V(d) = E{Y*(d)} of a given, or
fixed, Ψ-specific regime d ∈ D
• In its own right
• As a stepping stone to estimation of an optimal regime dopt
• We consider several methods
• Throughout, take SUTVA (5.10), SRA (5.11), and positivity
assumption (5.15) to hold
273
59. Estimation via g-computation
In principle: From (5.25) and (5.26), can estimate pY*(d)(y) or
E{Y*(d)} for fixed d ∈ D
• Posit parametric models, e.g., for (5.25)
pY|X ,A(y|x, a; ζK+1)
pXk |Xk−1,Ak−1
(xk |xk−1, ak−1; ζk ), k = 2, . . . , K
pX1
(x1; ζ1)
depending on ζ = (ζT
1 , . . . , ζT
K+1)T (or for (5.26) a model for
E(Y|X = x, A = a) instead)
• Estimate ζ by maximizing the partial likelihood
n
i=1
pY|X ,A(Yi |Xi , Ai ; ζK+1)
K
k=2
pXk |Xk−1,Ak−1
(Xki |Xk−1,i , Ak−1,i ; ζk ) pX1
(X1i ; ζ1)
in ζ to obtain ζ = (ζT
1 , . . . , ζT
K+1)T
274
60. Estimation via g-computation
In principle:
• Substitute the fitted models in (5.25) or (5.26)
Major obstacle: (5.25) and (5.26) involve integration over the sample
space X = X1 × · · · × XK
• Except in the simplest situations, e.g., all x1, . . . , xK discrete and
low-dimensional, the required integration is almost certainly
analytically intractable and computationally insurmountable
275
61. Estimation via g-computation
Monte Carlo integration: Robins (1986) proposed approximating
the distribution of Y*(d) for d ∈ D; for r = 1, . . . , M, simulate a
realization from the distribution of Y*(d) as follows
1. Generate random x1r from pX1
(x1; ζ1)
2. Generate random x2r from pX2|X1,A1
{x2|x1r , d1(x1r ); ζ2}
3. Continue in this fashion, generating random xkr from
pXk |Xk−1,Ak−1
{xk |xk−1,r , dk−1(xk−1,r ); ζk }, k = 3, . . . , K
4. Generate random yr from pY|X ,A{y|xr , dK (xr ); ζK+1}
y1, . . . , yM are a sample from the fitted distribution of Y*(d)
Estimator for V(d) = E{Y*(d)}: VGC(d) = M−1
M
r=1
yr
276
62. Estimation via g-computation
Practical challenges:
• Development of models can be daunting due to high
dimension/complexity of x1, . . . , xK
• Although specifying pY|X ,A(y|x, a; ζK+1) may be feasible for
univariate Y, models
pXk |Xk−1,Ak−1
(xk |xk−1, ak−1; ζk ), k = 2, . . . , K, pX1
(x1; ζ1)
for multivariate Xk are more challenging to specify
• E.g., Xk = (XT
k1, XT
k2)T continuous/discrete, can factor as
pXk1|Xk2,Xk−1,Ak−1
(xk1|xk2, xk−1, ak−1; ζk1)
× pXk2|Xk−1,Ak−1
(xk2|xk−1, ak−1; ζk2)
or pXk2|Xk1,Xk−1,Ak−1
(xk2|xk1, xk−1, ak−1; ζk2)
× pXk1|Xk−1,Ak−1
(xk2|xk−1, ak−1; ζk1)
277
63. Estimation via g-computation
Practical challenges, continued:
• Moreover, simulation from such models can be demanding
• Analytical derivation of approximate standard errors is not
straightforward (frankly daunting!); use of a nonparametric
bootstrap has been advocated, which is clearly highly
computationally intensive
Bottom line: Estimation of V(d) via the g-computation is not
commonplace in practice
• The main usefulness of g-computation is as a demonstration that
it is possible to identify and estimate V(d) from observed data
under SUTVA, SRA, and positivity
• In principle, a possible approach to estimating dopt ∈ D is to
maximize VGC(d) over all d ∈ D; clearly, this would be a
formidable computational challenge (and is never done in
practice)
278
64. Inverse probability weighted estimator
Motivation: Alternative representation of E{Y*(d)} in terms of the
observed data
(X1, A1, X2, A2, . . . , XK , AK , Y)
depending on
pA1|H1
(a1|h1) = P(A1 = a1|H1 = h1) = pA1|X1
(a1|x1)
pAk |Hk
(ak |hk ) = P(Ak = ak |Hk = hk ) = pAk |Xk ,Ak−1
(ak |xk , ak−1)
k = 2, . . . , K
Define: Evaluate at d1(X1), dk {Xk , dk−1(Xk−1)}
πd,1(X1) = pA1|X1
{d1(X1)|X1}
πd,k (Xk ) = pAk |Xk ,Ak−1
[dk {Xk , dk−1(Xk−1)}|Xk , dk−1(Xk−1)]
k = 2, . . . , K
279
65. Inverse probability weighted estimator
Define: Indicator of consistency of options actually received with
those selected by d at all K decisions
Cd = CdK
= I{A1 = d1(X1), . . . , AK = dK (XK )} = I{A = d(X )}
Inverse probability weighted estimator for V(d) = E{Y*(d)}:
VIPW (d) = n−1
n
i=1
Cd,iYi
K
k=2 πd,k (Xki) πd,1(X1i)
(5.27)
• (5.27) is an unbiased estimator for V(d) because
E
Cd Y
K
k=2 πd,k (Xk ) πd,1(X1)
= E{Y*
(d)} (5.28)
under SUTVA (5.10), SRA (5.11) and positivity assumption (5.15)
280
66. Inverse probability weighted estimator
Sketch of proof of (5.28): We show a more general result
E
Cd f(X , Y)
K
k=2 πd,k (Xk ) πd,1(X1)
= E[f{X*
K (dK−1), Y*
(d)}] (5.29)
so that (5.28) follows by taking f(x, y) = y
• Using SUTVA and (5.7)
E
Cd f(X , Y)
K
k=2 πd,k (Xk ) πd,1(X1)
= E
Cd f{X*
K (dK−1), Y*
(d)}
K
k=2 πd,k {X*
k (dk−1)} πd,1(X1)
= E
E
I{A = d(X )}f{X*
K (dK−1), Y*
(d)}
K
k=2 πd,k {X*
k (dk−1)} πd,1(X1)
X1, W*
= E
P{A = d(X )|X1, W*
}f{X*
K (dK−1), Y*
(d)}
K
k=2 πd,k {X*
k (dk−1)} πd,1(X1)
(5.30)
281
67. Inverse probability weighted estimator
From (5.30): Must show
K
k=2
πd,k {X*
k (dk−1)} πd,1(X1) = P{A = d(X )|X1, W*
} > 0 (5.31)
• For (x1, w) such that P(X1 = x1, W*
= w) > 0, show
P{A = d(X )|X1 = x1, W*
= w} > 0
• There exist x2, . . . , xK such that P{X*
k (dk−1) = xk } > 0, k = 2, . . . , K,
because X*
K (dK−1) is a function of W*
• Using SUTVA
P{A = d(X )|X1 = x1, W*
= w}
= P{A1 = d1(x1)|X1 = x1, W*
= w}
×
K
k=2
P[Ak = dk {xk , dk−1(xk−1)}|Ak−1 = dk−1(xk−1), X1 = x1, W*
= w]
282
68. Inverse probability weighted estimator
• Must show each term in this factorization is well defined; true if
P{Ak = dk (xk ), X1 = x1, W*
= w} > 0, k = 1, . . . , K
• By induction; when k = 1
P{A1 = d1(x1), X1 = x1, W*
= w}
= P{A1 = d1(x1) | X1 = x1, W*
= w}P(X1 = x1, W*
= w)
is positive if P{A1 = d1(x1) | X1 = x1, W*
= w} > 0
• This holds because P(X1 = x1) > 0 so x1 ∈ Γ1 and d1(x1) ∈ Ψ1(x1) so
by positivity assumption
P{A1 = d1(x1) | X1 = x1, W*
= w} = P{A1 = d1(x1) | X1 = x1} > 0
• Now assume P{Ak = dk (xk ), X1 = x1, W*
= w} > 0 and show
P{Ak+1 = dk+1(xk+1), X1 = x1, W*
= w} > 0
283
69. Inverse probability weighted estimator
• X*
k (dk−1) includes X1, so (X1 = x1, W*
= w) and
{X*
k+1(d) = xk+1, W*
= w} are equivalent, and thus
P{Ak+1 = dk+1(xk+1), X1 = x1, W*
= w}
= P[Ak+1 = dk+1{xk+1, dk (xk )}|Ak = dk (xk ), X*
k+1(dk ) = xk+1, W*
= w]
× P{Ak = dk (xk ), X1 = x1, W*
= w}
• Must show first RHS term is > 0; by SUTVA and SRA, this term =
= P[Ak+1 = dk+1{xk+1, dk (xk )}|Xk+1 = xk+1, Ak = dk (xk ), W*
= w]
= P[Ak+1 = dk+1{xk+1, dk (xk )}|Xk+1 = xk+1, Ak = dk (xk )]
• Because {x1, d1(x1)} ∈ Λ1 and P{X*
k (dk−1) = xk } > 0, the argument
leading to (5.23) yields {xk+1, dk (xk )} ∈ Γk+1,
dk+1{xk+1, dk (xk )} ∈ Ψk+1(hk+1), so this term is > 0
284
70. Inverse probability weighted estimator
• Applying these results yields
P{A = d(X )|X1, W*
} = pA1|X1
{d1(X1)|X1}
×
K
k=2
pAk |Xk ,Ak−1
[dk {Xk , dk−1(Xk−1)}|Xk , dk−1(Xk−1)]
= πd,1(X1)
K
k=2
πd,k {X*
k (dk−1)}
which is the desired equality
285
71. Inverse probability weighted estimator
Result: (5.28) holds
E
Cd Y
K
k=2 πd,k (Xk ) πd,1(X1)
= E{Y*
(d)}
• VIPW (d) is an unbiased estimator for V(d)
• Alternative representation of E{Y*(d)} in terms of observed data
• Taking instead for fixed y
f{X*
K (dK−1), Y*
(d)} = I{Y*
(d) = y}
yields an alternative representation of the marginal density of
Y*(d)
286
72. Inverse probability weighted estimator
More generally: For fixed (x1, x2, . . . , xK , y), treating all variables as
discrete, taking
f{X*
K (dK−1), Y*
(d)}
= I{X1 = x1, X*
2(d1) = x2, . . . , X*
K (dK−1) = xK , Y*
(d) = y}
yields an alternative representation of the joint density
P{X1 = x1, X*
2(d1) = x2, . . . , X*
K (dK−1) = xK , Y*
(d) = y}
= pX1,X*
2(d1),X*
3(d2),...,X*
K
(dK−1),Y*(d)(x1, . . . , xK , y)
= E
Cd I(X1 = x1, X2 = x2, . . . , XK = xK , Y = y)
K
k=2 πd,k (Xk ) πd,1(X1)
287
73. Inverse probability weighted estimator
Denominator:
K
k=2
πd,k (Xk ) πd,1(X1)
• Can be interpreted as the propensity for receiving treatment
consistent with regime d through all K decisions given observed
history
• Depends on the propensities of treatment given observed history
pAk |Hk
(ak |hk ) = P(Ak = ak | Hk = hk ), k = 1, . . . , K
• In practice: Posit and fit models for the propensities depending
on parameters γk and estimators γk , k = 1, . . . , K;
considerations for this momentarily
• These models induce models πd,1(X1; γ1) and πd,k (Xk ; γk )
288
74. Inverse probability weighted estimator
In practice: IPW estimator
VIPW (d) = n−1
n
i=1
Cd,iYi
K
k=2 πd,k (Xki; γk ) πd,1(X1i; γ1)
(5.32)
• Consistent estimator for V(d) as long as the propensity models
are correctly specified
289
75. Inverse probability weighted estimator
Alternative estimator:
VIPW∗ (d) =
n
i=1
Cd,i
K
k=2 πd,k (Xki; γk ) πd,1(X1i; γ1)
−1
×
n
i=1
Cd,iYi
K
k=2 πd,k (Xki; γk ) πd,1(X1i; γ1)
(5.33)
• Weighted average, also a consistent estimator
• Can be considerably more precise than VIPW (d)
290
76. Considerations for propensity modeling
Simplest case: Two options at each decision point, feasible for all
individuals, Ak = {0, 1}, k = 1, . . . , K
• Can work with propensity scores
πk (hk ) = P(Ak = 1|Hk = hk ) = πk (xk , ak−1), k = 1, . . . , K
pA1|X1
(a1|x1) = pA1|H1
(a1|h1) = π1(h1)a1
{1 − π1(h1)}1−a1
pAk |Xk ,Ak−1
(ak |xk , ak−1) = pAk |Hk
(ak |hk ) = πk (hk )ak
{1 − πk (hk )}1−ak
k = 2, . . . , K
• Using (5.2)
πd,1(X1) = π1(X1)d1(X1)
{1 − π1(X1)}1−d1(X1)
πd,k (Xk ) = πk {Xk , dk−1(Xk−1)}dk {Xk ,dk−1(Xk−1)}
× [1 − πk {Xk , dk−1(Xk−1)}]1−dk {Xk ,dk−1(Xk−1)}
,
k = 2, . . . , K
291
77. Considerations for propensity modeling
• Can posit a parametric model πk (hk ; γk ) for each k; e.g., logistic
regression models as in (3.12)
πk (hk ; γk ) =
exp(γk1 + γT
k2hk )
1 + exp(γk1 + γT
k2hk )
, γk = (γk1, γT
k2)T
, k = 1, . . . , K
hk = (1, hT
k )T , k = 1, . . . , K, and fit via maximum likelihood to
obtain γk , k = 1, . . . , K
292
78. Considerations for propensity modeling
More than 2 options: Feasible for all individuals, Ak = {1, . . . , mk }
• As on Slide 170, with
ωk (hk , ak ) = P(Ak = ak |Hk = hk ), k = 1, . . . , K
or ωk (xk , ak−1, ak ) = P(Ak = ak |Xk = xk , Ak−1 = ak−1)
where
ωk (hk , mk ) = 1 −
mk −1
ak =1
ωk (hk , ak )
• Then
πd,1(X1) =
m1
a1=1
ω1(X1, a1)I{d1(X1)=a1}
=
m1
a1=1
I{d1(X1) = a1}ω1(X1, a1)
πd,k (Xk ) =
mk
ak =1
ωk {Xk dk−1(Xk−1), ak }I[dk {Xk ,dk−1(Xk−1)}=ak ]
,
=
mk
ak =1
I[dk {Xk , dk−1(Xk−1)} = ak ]ωk {Xk , dk−1(Xk−1), ak }
k = 2, . . . , K293
79. Considerations for propensity modeling
• Can posit parametric models ωk (hk , ak ; γk ) for each k e.g.,
multinomial (polytomous) logistic regression models
ωk (hk , ak ; γk ) =
exp(hT
k γk,ak
)
1 + mk −1
j=1 exp(hT
k γkj)
, ak = 1, . . . , mk − 1
hk = (1, hT
k )T , γk = (γT
k1, . . . , γT
k,mk −1)T and fit via maximum
likelihood to obtain γk , k = 1, . . . , K
294
80. Considerations for propensity modeling
Feasible sets: k distinct subsets, each with ≥ 2 options in Ak
Ak,l = {1, . . . , mkl}, , l = 1, . . . , k , k = 1, . . . , K
• For k = 1, . . . , K, ak ∈ {1, . . . , mkl} = Ak,l
ωk,l(hk ,ak ) = P(Ak = ak |Hk = hk )
ωk,l(hk , mkl) = 1 −
mkl −1
ak =1
ωk,l(hk , ak )
• For k = 1, . . . , K, can posit k separate logistic or multinomial
(polytomous) logistic regression models
ωk,l(hk , ak ; γkl), l = 1, . . . , k
295
81. Considerations for propensity modeling
• For each k = 1, . . . , K, implies an overall model
ωk (hk , ak ; γk ) =
k
l=1
I{sk (hk ) = l} ωk,l(hk , ak ; γkl), γk = (γT
k1, . . . , γT
k k
)T
understood that ak takes values in the relevant distinct subset
• For subsets with a single option, P(Ak = ak | Hk = hk ) = 1, and
no model is needed
• Induced models
πd,1(X1; γ1) =
1
l=1
I{s1(h1) = l}
m1l
a1=1
ω1,l (X1, a1; γ1l )I{d1(X1)=a1}
πd,k (Xk ; γk ) =
k
l=1
I{sk (hk ) = l}
×
mkl
ak =1
ωk,l {Xk , dk−1(Xk−1), ak ; γkl }I[dk {Xk ,dk−1(Xk−1)}=ak )]
296
82. Considerations for propensity modeling
In a SMART: Propensities are known
• As on Slide 90, preferable to estimate the propensities
• Estimate P(Ak = ak | Hk = hk ) by sample proportions corresponding to
each distinct subset Ak,l
• Example: Acute leukemia, 2 = 2, r2 component of h2 indicating
response
Ψ2(h2) = {M1, M2} = A2,1, r2 = 1
= {S1, S2} = A2,2, r2 = 0
• Estimate P(A2 = a2 | H2 = h2) for a2 ∈ A2,1 by
n
i=1
I(R2i = 1)
−1 n
i=1
I(R2i = 1)I(A2i = a2)
• Estimate P(A2 = a2 | H2 = h2) for a2 ∈ A2,2 by
n
i=1
I(R2i = 0)
−1 n
i=1
I(R2i = 0)I(A2i = a2)
297
83. Inverse probability weighted estimator
Equivalent representation: For both VIPW (d) and VIPW∗(d)
• When Cd = 1, A1 = d1(X1), and Ak = dk {Xk , dk−1(Xk−1)},
k = 2, . . . , K, and it is straightforward that the denominator
K
k=2
πd,k (Xki; γk ) πd,1(X1i; γ1)
can be replaced by the fitted model for
K
k=1
pAk |Hk
(Aki|Hki)
without altering the value of either estimator
298
84. Inverse probability weighted estimator
• E.g., with 2 feasible options at each decision point, replace the
denominator by
K
k=1
πk (Hki; γk )Aki {1 − πk (Hki; γk )}1−Aki
• With more than 2 options, by
K
k=1
ωk (Hki, Aki; γk )
• In some literature accounts, the estimators are defined with
these quantities in the denominators
• This will be important later
299
85. Inverse probability weighted estimator
Remarks:
• VIPW (d) and VIPW∗(d) are consistent estimators as long as the
propensity models are correctly specified; can be inconsistent
otherwise
• Except for estimation of propensities, these estimators use data
only from subjects for whom A = d(X )
• For larger K, there are likely very few such subjects
• These estimators can be unstable in finite samples due to
division by propensity of treatment consistent with d small and
can exhibit large sampling variation
• As in the single decision case, even if the propensities are
known, it is preferable to estimate them via maximum likelihood
• Large sample approximations to the sampling distributions of
VIPW (d) and VIPW∗(d) follow from M-estimation theory
300
86. Augmented inverse probability weighted estimator
Drop out analogy: For individuals for whom A = d(X )
• Analogy to a monotone coarsening problem; i.e., “drop out”
• Under SUTVA, an individual with
A1 = d1(X1), . . . , Ak−1 = dk−1{Xk−1, dk−2(Xk−2)}
but for whom
Ak = dk {Xk , dk−1(Xk−1)}
has X2 = X*
2(d1), . . . , Xk = X*
k (dk−1), i.e., Xk = X*
k (dk−1)
• But his Xk+1, . . . , XK and Y do not reflect the potential outcomes
he would have if he had continued to follow rules dk , . . . , dK
• Effectively, in terms of receiving treatment options consistent
with d, the individual has “dropped out” at Decision k
301
87. Augmented inverse probability weighted estimator
Improvement: From the perspective of information on
{X1, X*
K (dK−1), Y*(d)} and especially Y*(d)
• X*
k (dk−1) is observed, but X*
k+1(dk ), . . . , X*
K (dK−1), Y*(d) are
“missing”
• Can we exploit the partial information from such individuals to
gain efficiency?
Dropout analogy: Under SUTVA, SRA, and positivity, this “drop out”
is according to a missing (coarsening) at random mechanism (shown
by Zhang et al., 2013), allowing semiparametric theory for monotone
coarsening at random to be used (Robins et al., 1994; Tsiatis, 2006)
302
88. Augmented inverse probability weighted estimator
Define:
• Indicator of treatment options consistent with d through Decision
k
Cdk
= I{Ak = dk (Xk )}, k = 1, . . . , K
with Cd0
≡ 1
• For brevity, write πd,1(X1) = πd,1(X1) and
πd,k (Xk ) =
k
j=2
πd,j(Xj)
πd,1(X1), k = 2, . . . , K
with πd,0 ≡ 1
• Substitution of propensity models leads to models
πd,k (Xk ; γk ), k = 1, . . . , K
γk = (γT
1 , . . . , γT
k )T , k = 1, . . . , K
303
89. Augmented inverse probability weighted estimator
Analogous to AIPW estimator for single decision: Under these
conditions, if the models for the propensities
pAk |Hk
(ak |hk ), k = 1, . . . , K
are correctly specified, from semiparametric theory, all consistent and
asymptotically normal estimators for V(d) for fixed d ∈ D are
asymptotically equivalent to an estimator of the form
VAIPW (d) = n−1
n
i=1
Cd,iYi
K
k=2 πd,k (Xki; γk ) πd,1(X1i; γ1)
(5.34)
+
K
k=1
Cdk−1,i
πd,k−1(Xk−1,i; γk−1)
−
Cdk ,i
πd,k (Xki, γk )
Lk (Xki)
• Lk (xk ) are arbitrary functions of xk , k = 1, . . . , K
• γk = (γT
1 , . . . , γT
k )T , k = 1, . . . , K
304
90. Augmented inverse probability weighted estimator
Features of VAIPW (d):
• Taking Lk (xk ) ≡ 0, k = 1, . . . , K, yields VIPW (d) in (5.32)
• When K = 1, because Cd0
≡ 1, πd,0 ≡ 1, and X1 = H1, (5.34)
reduces to the AIPW estimator (3.16) for the single decision case
• If all K propensity models are correctly specified, so there are
true values γk,0 of γk , k = 1, . . . , K, the “augmentation term” in
(5.34) evaluated at γk,0, k = 1, . . . , K, converges in probability to
zero for arbitrary Lk (xk ), k = 1, . . . , K
• Thus, (5.34) is a consistent estimator for V(d) with asymptotic
variance depending on the choice of Lk (xk )
305
91. Augmented inverse probability weighted estimator
Efficient estimator: From semiparametric theory, the estimator with
smallest asymptotic variance among those in the class (5.34) takes
Lk (xk ) = E{Y*
(d) | X*
k (dk−1) = xk }, k = 1, . . . , K (5.35)
• The conditional expectations in (5.35) are functionals of the
distribution of the potential outcomes
{X1, X*
2(d1), . . . , X*
k (dk−1), Y*(d)} and are unknown in practice
• As in the single decision case, posit and fit models
Qd,k (xk ; βk ), k = 1, . . . , K, (5.36)
for E{Y*(d) | X*
k (dk−1) = xk }, k = 1, . . . , K, and substitute in
(5.34)
• We present approaches to developing and fitting models (5.36)
when we discuss optimal regimes later
306
92. Augmented inverse probability weighted estimator
Result: Given estimators βk , k = 1, . . . , K, obtained as we discuss
later, the AIPW estimator is
VAIPW (d) = n−1
n
i=1
Cd,iYi
K
k=2 πd,k (Xki; γk ) πd,1(X1i; γ1)
(5.37)
+
K
k=1
Cdk−1,i
πd,k−1(Xk−1,i; γk−1)
−
Cdk ,i
πd,k (Xk,i, γk )
Qd,k (Xki; βk )
• VAIPW (d) in (5.37) is doubly robust; i.e., is consistent if either (1)
the models for the propensities and thus for πd,1(x1) and
πd,k (xk ), k = 2, . . . , K, or (2) the models for for
E{Y*(d) | X*
k (dk ) = xk } in (5.36) are correctly specified
• If the observed data are from a SMART, the propensities are
known, and VAIPW (d) is guaranteed to be consistent regardless
of the models (5.36)
307
93. Augmented inverse probability weighted estimator
Efficient estimator: If all models are correct, VAIPW (d) is efficient
among estimators in class (5.37), achieving the smallest asymptotic
variance
Remarks:
• VAIPW (d) can exhibit considerably less sampling variation than
the simple IPW estimators (5.32) and (5.33)
• As for VIPW (d) and VIPW∗(d), large sample approximate
sampling distribution can be obtained via M-estimation theory
(although pretty involved)
• Zhang et al. (2013) propose (5.37) in a different but equivalent
form following directly from that in Tsiatis (2006)
308
94. Estimation via marginal structural models
Alternative approach: When scientific interest focuses on regimes
with simple rules that can be represented in terms of a
low-dimensional parameter η
• Formally, restrict to a subset Dη ⊂ D with elements
dη = {d1(h1; η1), . . . , dK (hK ; ηK )}, η = (ηT
1 , . . . , ηT
K )T
• Goal: Estimate the value of a fixed regime dη∗ ∈ Dη
corresponding to a particular η∗, V(dη∗ ) = E{Y*(dη∗ )}
• As in the example coming next, Dη may be even simpler, with
η = η1 = · · · = ηK
309
95. Estimation via marginal structural models
Example: Treatment of HIV-infected patients
• K monthly clinic visits, decision on whether or not to administer
antiretrovial (ARV) therapy for the next month based on CD4
T-cell count (cells/mm3) (larger is better)
• Two feasible options: 1 = ARV therapy for the next month or 0 =
no ARV therapy for the next month
• Focus on rules involving a common CD4 threshold η below
which ARV therapy is administered and above which it is not
dk (hk ; η) = I(CD4k ≤ η), k = 1, . . . , K
CD4k = CD4 T cell count (cells/mm3) immediately prior to
Decision k
• Dη comprises regimes dη with rules of this form
• Final outcome: Y*(dη) = negative viral load (viral RNA
copies/mL) measured 1 month after Decision K if administered
ARV therapy using rules in dη
310
96. Estimation via marginal structural models
Focus: Estimation of V(dη∗ ) for particular threshold η∗
• Ideal: Data from a study where HIV patients were given ARV
therapy according to dη∗ and viral load was ascertained after
Decision K
• More likely: Available data are observational, with information
Xk , k = 1, . . . , K, including CD4k , options Ak actually received,
and −final viral load Y recorded
• In these data, there are likely very few if any individuals who
received ARV therapy according to the rules in dη∗
• An approach to using these data to estimate V(dη∗ ) is suggested
by work of Orellana et al. (2010ab)
311
97. Estimation via marginal structural models
Marginal structural model: A model for V(dη) as a function of η
• I.e., V(dη) = µ(η) for some function µ( · ), posit a parametric
model
V(dη) = E{Y*
(dη)} = µ(η; α) (5.38)
referred to as a marginal structural model (MSM)
• For example, a quadratic model
µ(η; α) = α1 + α2η + α3η2
, α = (α1, α2, α3)T
• Estimate α based on the data by an appopriate method to obtain
α, and estimate V(dη∗ ) by
VMSM(dη∗ ) = µ(η∗
; α)
• η plays the role of “covariate” and µ(η∗; α) is the “predicted
value” at the particular value η∗ of interest
312
98. Estimation via marginal structural models
Hope: The MSM is a correct specification of the true relationship
µ(η) across a plausible range of thresholds of interest
Estimation of α: Ideally, data from a prospective randomized study
• Each subject i = 1, . . . , n is randomized to one of m
predetermined thresholds η(j), j = 1, . . . , m, in the range of
interest and receives ARV according to dη(j)
• Natural estimator α solves in α the estimating equation
n
i=1
m
j=1
Cdη(j)
,i
∂µ(η(j); α)
∂α
w(η(j)){Yi − µ(η(j); α)} = 0 (5.39)
where w(η) is a weight function, Cdη
= I{A = dη(X )}
• w(η) ≡ 1 yields OLS estimation
• Each subject i has treatment experience consistent exactly one
of the η(j), j = 1, . . . , m
• (5.39) is an unbiased estimating equation if µ(η; α) is correct
313
99. Estimation via marginal structural models
Observtional data: Approach is motivated by (5.39)
• Some individuals have treatment experience consistent with ≥ 1
values of η, e.g., with K = 2
Experience Consistent with
CD41=300, A1 = 1, CD42=400, A2 = 0 η ∈ (300, 400)
CD41=300, A1 = 1, CD42=400, A2 = 1 η > 400
CD41=300, A1 = 0, CD42=400, A2 = 0 η < 300
• Others have treatment experience consistent with no value of η
Experience Consistent with
CD41=300, A1 = 0, CD42=400, A2 = 1 no η
314
100. Estimation via marginal structural models
Observational data: Estimator α is solution to
n
i=1 Dη
Cdη,i
K
k=2 πdη,k (Xki; γk ) πdη,1(X1i; γ1)
×
∂µ(η; α)
∂α
w(η){Yi − µ(η; α)}
dν(dη) = 0
• dν(dη) is an appropriate dominating measure on dη ∈ Dη
• w(η) is a weight function
• A given individual i can contribute information on multiple
thresholds or none at all
• As in the IPW estimators, each of her contributions is weighted
by the reciprocal of an estimator for the propensity of receiving
treatment consistent with dη given observed history
315
101. Estimation via marginal structural models
For example: Interest in η ∈ [100, 500], for j = 1, . . . , m, partition
η(j) = 100 + 400(j − 1)/(m − 1), dν(dη) places point mass on η(j)
n
i=1
m
j=1
Cdη(j)
,i
K
k=2 πη(j),k (Xki; γk ) πη(j),1(X1i; γ1)
×
∂µ(η(j); α)
∂α
w(η(j)){Yi − µ(η(j); α)}
= 0
• Under SUTVA, SRA, and positivity, if µ(η; α) and the propensity
models are correctly specified, these estimating equations can
be shown to be unbiased, and α is an M-estimator
• Here, if there is a sufficient # of individuals who received
treatment consistent with at least one η(j), j = 1, . . . , m, α should
be a reasonable estimator in practice
• An augmented version of the estimating equation is possible
316
102. 5. Multiple Decision Treatment Regimes: Framework and
Fundamentals
5.1 Multiple Decision Treatment Regimes
5.2 Statistical Framework
5.3 The g-Computation Algorithm
5.4 Estimation of the Value of a Fixed Regime
5.5 Key References
317
103. References
Orellana, L., Rotnitzky, A., and Robins, J. M. (2010a). Dynamic
regime marginal structural mean models for estimation of optimal
dynamic treatment regimes, part I: Main content. The International
Journal of Biostatistics, 6.
Orellana, L., Rotnitzky, A., and Robins, J. M. (2010b). Dynamic
regime marginal structural mean models for estimation of optimal
dynamic treatment regimes, part II: Proofs and additional results. The
International Journal of Biostatistics, 6.
Robins, J. (1986). A new approach to causal inference in mortality
studies with sustained exposure periods - application to control of the
healthy worker survivor effect. Mathematical Modelling, 7,
1393–1512.
Robins, J. (1987). Addendum to: A new approach to causal inference
in mortality studies with sustained exposure periods - application to
control of the healthy worker survivor effect. Computers and
Mathematics with Applications, 14, 923–945.318
104. References
Robins, J. M. (2004). Optimal structural nested models for optimal
sequential decisions. In Lin, D. Y. and Heagerty, P., editors,
Proceedings of the Second Seattle Symposium on Biostatistics,
pages 189–326, New York. Springer.
Zhang, B., Tsiatis, A. A., Laber, E. B., and Davidian, M. (2013).
Robust estimation of optimal dynamic treatment regimes for
sequential treatment decisions. Biometrika, 100, 681–694.
Zhao, Y., Zeng, D., Rush, A. J., and Kosorok, M. R. (2012). Estimating
individual treatment rules using outcome weighted learning. Journal
of the American Statistical Association, 107, 1106–1118.
319