2019 PMED Spring Course - Multiple Decision Treatment Regimes: Fundamentals - Marie Davidian, February 13, 2019

5. Multiple Decision Treatment Regimes: Framework and
Fundamentals
5.1 Multiple Decision Treatment Regimes
5.2 Statistical Framework
5.3 The g-Computation Algorithm
5.4 Estimation of the Value of a Fixed Regime
5.5 Key References
216

Recall: Acute Leukemia
• At baseline: Information x1, history h1 = x1 ∈ H1
• Decision 1: Set of options A1 = {C1,C2}; rule 1: d1(h1): H1 → A1
• Between Decisions 1 and 2: Collect additional information x2, including
responder status
• Accrued information/history h2 = (x1, therapy at decision 1, x2) ∈ H2
• Decision 2: Set of options A2 = {M1,M2,S1,S2}; rule 2:
d2(h2): H2 → {M1,M2} (responder), d2(h2): H2 → {S1,S2} (nonresponder)
• Treatment regime : d = {d1(h1), d2(h2)} = (d1, d2)
217

K decision treatment regime
In general: K decision points/stages
• Baseline information x1 ∈ X1, intermediate information xk ∈ Xk
between Decisions k − 1 and k, k = 2, . . . , K
• Treatment options Ak at Decision k, elements ak ∈ Ak ,
k = 1, . . . , K
• Accrued information or history
h1 = x1 ∈ H1
hk = (x1, a1, . . . , xk−1, ak−1, xk ) ∈ Hk , k = 2, . . . , K,
• Decision rules d1(h1), d2(h2), . . . , dK (hK ), dk : Hk → Ak
• Treatment regime
d = {d1(h1), . . . , dK (hK )} = (d1, d2, . . . , dK )
218

Notation
Overbar notation: With
xk ∈ Xk , ak ∈ Ak , k = 1, . . . , K
it is convenient to deﬁne for k = 1, . . . , K
xk = (x1, . . . , xk ), ak = (a1, . . . , ak )
Xk = X1 × · · · × Xk , Ak = A1 × · · · × Ak
• Conventions: a = aK , x = xK , a0 is null
• h1 = x1, hk = (xk , ak−1), k = 2, . . . , K
• H1 = X1, Hk = Xk × Ak−1, k = 2, . . . , K
• dk = (d1, . . . , dk ), k = 1, . . . , K, d = dK = (d1, . . . , dK )
219

Decision points
Decision points depend on the context: For example
• Acute leukemia: Decisions are at milestones in the disease
progression (diagnosis, evaluation of response)
• HIV infection: Decisions are according to a schedule (at monthly
clinic visits)
• Cardiovascular disease (CVD): Decisions are made upon
occurrence of an event (myocardial infarction)
Perspective:
• Timing of decisions can be ﬁxed (HIV) or random (leukemia)
• If any individual would reach all K decision points, the distinction
is not important; we assume this going forward
• If different individuals can experience different numbers of
decisions (CVD), the number of decisions reached is random,
and a more specialized framework is required
• The latter also is the case if the outcome is a time to an event
220

Example: K = 2, acute leukemia
Decision 1: A1 = {C1,C2}, x1 = h1 includes age (years), baseline
white blood cell count WBC1 (×103/µl)
• Example rule:
d1(h1) = C2 I(C) + C1 I(Cc
), C = {age < 50, WBC1 < 10}
Decision 2: A2 = {M1, M2, S1, S2}
• x2 and thus h2 includes Decision 2 WBC2; ECOG, EVENT (≥
grade 3 adverse event from induction therapy), RESP
• Example rule:
d2(h2) = I(RESP = 1){M1 I(M) + M2 I(Mc
)}
+ I(RESP = 0){S1 I(S) + S2 I(Sc
)}
(5.1)
M = {WBC1 < 11.2, WBC2 < 10.5, EVENT = 0, ECOG ≤ 2}
S = {age > 60, WBC2 < 11.0, ECOG ≥ 2}
221

Recursive representation of rules
Fact: If the K rules in d ∈ D are followed by an individual, the options
selected at each decision depend only on the evolving x1, . . . , xK
• Decision 1: Option selected is d1(h1) = d1(x1)
• Between Decisions 1 and 2: x2
• Decision 2: Rule d2(h2) = d2(x2, a1), option selected depends
on the option selected at Decision 1
d2{x2, d1(x1)}
• Between Decisions 2 and 3: x3
• Decision 3: Rule d3(h3) = d3(x3, a2) = d3(x3, a1, a2), option
selected depends on those at Decisions 1 and 2
d3[x3, d1(x1), d2{x2, d1(x1)}]
• And so on. . .
222

Recursive representation of rules
Concise representation: For k = 2, . . . , K
d2(x2) = [d1(x1), d2{x2, d1(x1)}]
d3(x3) = [d1(x1), d2{x2, d1(x1)}, d3{x3, d2(x2)}]
... (5.2)
dK (xK ) = [d1(x1), d2{x2, d1(x1)}, . . . , dK {xK , dK−1(xK−1)}]
• dk (xk ) comprises the options selected through Decision k
• d(x) = dK (xK )
• This representation will be useful later
223

Redundancy
From the perspective of an individual following the K rules in d:
• The deﬁnition of rules dk (hk ) = dk (xk , ak−1) is redundant
• a1 is determined by x1, a2 is determined by x2 = (x1, x2), . . .
• But deﬁnition of dk as a function of hk = (xk , ak−1) is useful for
characterizing and estimating an optimal regime later
Illustration of redundancy: K = 2, A1 = {0, 1}, A2 = {0, 1},
X1 = {0, 1}, X2 = {0, 1} (x1 and x2 are binary)
• d = (d1, d2) ∈ D, d1 : H1 = X1 → A1, d2 : H2 = X2 × A1 → A2
• For each value of h1 = x1, d1 must return a value in A1, e.g.,
d1(x1 = 0) = 0, d1(x1 = 1) = 1
224

Redundancy
Illustration of redundancy, continued:
• For each of the 23 = 8 possible values of h2 = (x1, x2, a1), d2
must return a value in A2, e.g.,
d2(x1 = 0, x2 = 0, a1 = 0) = 0
d2(x1 = 0, x2 = 0, a1 = 1) = 1∗
d2(x1 = 0, x2 = 1, a1 = 0) = 1
d2(x1 = 0, x2 = 1, a1 = 1) = 1∗
d2(x1 = 1, x2 = 0, a1 = 0) = 0∗
d2(x1 = 1, x2 = 0, a1 = 1) = 1
d2(x1 = 1, x2 = 1, a1 = 0) = 0∗
d2(x1 = 1, x2 = 1, a1 = 1) = 0
• Conﬁgurations with ∗ could never occur if an individual followed
regime d
225

Fundamentals
5.5 Key References
226

Outcome of interest
Determination of outcome:
• Can be ascertained after Decision K, e.g., for HIV infected
patients, outcome = viral load (viral RNA copies/mL) measured
at a final clinic visit after Decision K
• Can be defined using intervening information, e.g., for HIV
infected patients with CD4 count (cells/mm3) measured at each
clinic visit (Decisions k = 2, . . . , K) and at a final clinic visit after
Decision K
outcome = total # CD4 counts > 200 cells/mm3
Convention: As for single decision case, assume larger outcomes
are preferred
• E.g., because smaller viral load is better, take
outcome = −viral load
227

Potential outcomes for K decisions
Intuitively: For a randomly chosen individual with history X1
• If he/she were to receive a1 ∈ A1 at Decision 1, the evolution of
his/her disease/disorder process after Decision 1 would be be
inﬂuenced by a1
• Suggests: X*
2(a1) = intervening information that would arise
between Decisions 1 and 2 after receiving a1 ∈ A1 at Decision 1
• If he/she then were to receive a2 ∈ A2 at Decision 2, the
evolution of his/her disease/disorder process after Decision 2
would be inﬂuenced by a1 followed by a2
• Suggests: X*
3(a2) = intervening information that would arise
between Decisions 2 and 3 after receiving a1 ∈ A1 at Decision 1
and a2 at Decision 2
• And so on. . .
228

Potential outcomes for K decisions
Ultimately: If he/she were to receive options a = aK = (a1, . . . , aK )
at Decisions 1,. . . ,K
Y*
(aK ) = Y*
(a) = outcome that would be achieved
Summarizing: Potential information at each decision and potential
outcome if an individual were to receive a = (a1, . . . , aK )
{X1, X*
2(a1), X*
3(a2), . . . , X*
K (aK−1), Y*
(a)}
• X1 may or may not be included
• Set of all possible potential outcomes for all a ∈ A
W*
= X*
2(a1), X*
3(a2), . . . , X*
K (aK−1), Y*
(a),
for a1 ∈ A1, a2 ∈ A2, . . . , aK−1 ∈ AK−1, a ∈ A (5.3)
229

Feasible sets and classes of treatment regimes
Example: Acute leukemia, K = 2
A1 = {C1,C2}, A2 = {M1, M2, S1, S2}
• At Decision 1, both options are feasible for all patients
• At Decision 2, only maintenance options are feasible for patients
who respond, and only salvage options are feasible for patients
who do not respond
• Is the case almost always in multiple decision problems at
decision points other than Decision 1
• Of course is also possible at Decision 1
230

Formal speciﬁcation: For any history hk = (xk , ak−1) ∈ Hk at
Decision k, k = 1, . . . , K
• The set of feasible treatment options at Decision k is
Ψk (hk ) = Ψk (xk , ak−1) ⊆ Ak , k = 1, . . . , K (5.4)
Ψ1(h1) = Ψ1(x1) ⊆ A1 (a0 null)
• Ψk (hk ) is a function mapping Hk to the set of all possible
subsets of Ak
• Ψk (hk ) can be a strict subset of Ak or all of Ak , depending on hk
• Collectively: Ψ = (Ψ1, . . . , ΨK )
231

Example, revisited: Acute leukemia, K = 2
A1 = {C1,C2}, A2 = {M1, M2, S1, S2}
• Suppose C1 and C2 are feasible for all individuals regardless of
h1
Ψ1(h1) = A1 for all h1
• If h2 indicates response
Ψ2(h2) = {M1, M2} ⊂ A2 for all such h2
• If h2 indicates nonresponse
Ψ2(h2) = {S1, S2} ⊂ A2 for all such h2
• More complex speciﬁcations of feasible sets that take account of
additional information are of course possible
232

Fancier example: Acute leukemia, K = 2
A1 = {C1,C2}, A2 = {M1, M2, S1, S2}
• If C1 is contraindicated for patients with renal impairment
Ψ1(h1) = {C2} if h1 indicates renal impairment
= {C1,C2} = A1 if h1 does not
• If S1 increases risk of adverse events in nonresponders with low
WBC2
Ψ2(h2) = {S1} if h2 indicates nonresponse, WBC2 ≤ w
= {S1, S2} if h2 indicates nonresponse, WBC2 > w
for some threshold w (×103/µl)
233

Ideally:
• Specification of the feasible sets is dictated by scientific
considerations
• Disease/disorder context, available treatment options, patient
population, etc
• Specification of Ψk (hk ), k = 1, . . . , K, should incorporate only
information in hk that is critical to treatment selection
Regimes and feasible sets: Given specified feasible sets Ψ
• A regime d = (d1, . . . , dK ) whose rules select treatment options
for history hk from those in Ψk (hk ) is defined in terms of Ψ
• Thus, regimes are Ψ-specific, and the relevant class of all
possible regimes D depends on Ψ (suppressed in the notation)
234

In practice: At Decision k, there is a small number k of subsets
Ak,l ⊆ Ak , l = 1, . . . , k that are feasible sets for all hk
• E.g, for acute leukemia, 2 = 2
A2,1 = {M1,M2}, A2,2 = {S1,S2}
• If r2 is the component of h2 indicating response
Ψ1(h2) = A2,1 for h2 with r2 = 1
= A2,2 for h2 with r2 = 0
Decision rules: Different rule for each subset
• E.g., for acute leukemia, as in (5.1)
d2(h2) = I(r2 = 1) d2,1(h2) + I(r2 = 0) d2,2(h2)
d2,1(h2) is a rule selecting maintenance therapy for responders
d2,2(h2) is a rule selecting salvage therapy for nonresponders
235

In general: With k distinct subsets Ak,l ⊆ Ak , l = 1, . . . , k , as
feasible sets
• Deﬁne sk (hk ) = 1, . . . , k according to which of these subsets
Ψk (hk ) corresponds for given hk
• dk,l(hk ) is the rule corresponding to the lth subset Ak,l
• Decision rule at Decision k has form
dk (hk ) =
k
l=1
I{sk (hk ) = l} dk,l(hk ) (5.5)
• Henceforth, it is understood that dk (hk ) may be expressed as in
(5.5) where appropriate
236

Note: For any Ψ-speciﬁc regime d = (d1, . . . , dK )
• At Decision k, dk (hk ) = dk (xk , ak−1) returns only options in
Ψk (hk ) = Ψk (xk , ak−1), i.e.,
dk (hk ) = dk (xk , ak−1) ∈ Ψk (hk ) ⊆ Ak
• Thus, dk need map only a subset of Hk = Xk × Ak−1 to Ak
• We discuss this more shortly
237

Potential outcomes for a ﬁxed regime d ∈ D
Intuitively: If a randomly chosen individual with history X1 were to
receive treatment options by following the rules in d
• Decision 1: Treatment determined by d1
• X*
2(d1) = intervening information that would arise between
Decisions 1 and 2
• Decision 2: Treatment determined by d2
• X*
3(d2) = intervening information that would arise between
Decisions 2 and 3 ...
• X*
k (dk−1) = intervening information that would arise between
Decisions k − 1 and k, k = 2, . . . , K
• Y*(d) = Y*(dK ) = outcome that would be achieved if all rules in
d were followed
Potential outcomes under regime d:
{X1, X*
2(d1), X*
3(d2), . . . , X*
K (dK−1), Y*
(d)} (5.6)
238

Potential outcomes for a fixed regime d ∈ D
Formally: These potential outcomes are functions of W* in (5.3)
• Define
X*
k (ak−1) = {X1, X*
2(a1), X*
3(a2), . . . , X*
k (ak−1)}, k = 2, . . . , K
• Then
X*
2(d1) =
a1∈A1
X*
2(a1)I{d1(X1) = a1}
X*
k (dk−1) =
ak−1∈Ak−1
X*
k (ak−1)
k−1
j=1
I dj {X*
j (aj−1), aj−1} = aj (5.7)
k = 3, . . . , K
Y*
(d) =
a∈A
Y*
(a)
K
j=1
I dj {X*
j (aj−1), aj−1} = aj
• Also define
X*
k (dk−1) = {X1, X*
2(d1), X*
3(d2), . . . , X*
k (dk−1)}, k = 2, . . . , K
239

Value of a K-decision regime
With these deﬁnitions: The value of d ∈ D is
V(d) = E{Y*
(d)}
• And an optimal regime dopt ∈ D satisﬁes
E{Y*
(dopt
)} ≥ E{Y*
(d)} for all d ∈ D (5.8)
equivalently
dopt
= arg max
d∈D
E{Y*
(d)} = arg max
d∈D
V(d)
240

Optimal treatment options vs. optimal decisions
For a randomly chosen individual with history H1:
• The optimal sequence of treatment options for this individual is
arg max
a∈A
Y*
(a)
which of course is not knowable in practice
• All that is known at baseline is H1, and dopt
1 , . . . , dopt
K select the
(feasible) options corresponding to the largest expected outcome
• From the deﬁnition of Y*(d) in (5.7),
Y*
(d) ≤ max
a∈A
Y*
(a) for all d ∈ D =⇒ Y*
(dopt
) ≤ max
a∈A
Y*
(a)
so an optimal regime might not select optimal options for this
individual at each decision
• Rather, dopt leads to the optimal sequence of decisions that can
be made based on the available information on this individual
241

Data
Ideally: For i = 1, . . . , n, i.i.d.
(X1i, A1i, X2i, A2i, . . . , XKi, AKi, Yi) = (XKi, AKi, Yi) = (Xi, Ai, Yi) (5.9)
• X1 = baseline information at Decision 1, taking values in X1
• Ak = treatment option actually received at Decision k,
k = 1, . . . , K, taking values in Ak
• Xk = intervening information between Decisions k − 1 and k,
k = 2, . . . , K, taking values in Xk
• Xk = (X1, . . . , Xk ), X = XK = (X1, . . . , XK ), and
Ak = (A1, . . . , Ak ), A = AK = (A1, . . . , AK )
• History H1 = X1, Hk = (X1, A1, . . . , Xk−1, Ak−1, Xk ) = (Xk , Ak−1),
k = 2, . . . , K
• Y = observed outcome (after Decision K or function of HK )
242

Data
Data sources:
• Longitudinal observational study: Retrospective from an existing
database, completed conventional clinical trial with followup,
prospective cohort study
• Randomized study: Prospective clinical trial conducted
speciﬁcally for this purpose (SMART)
Longitudinal observational study: Challenges
• Time-dependent confounding: A subject’s history at each
decision point is both determined by past treatments and used to
select future treatments
• Characteristics associated with both treatment selection and
future characteristics/ultimate outcome may not be captured in
the data
243

Data
SMART: Randomization at •s
244

Data
Do we really need data like these? Can’t we just “piece together ”
an optimal regime using single decision methods on data from
separate studies (with different subjects in each)?
• E.g., acute leukemia: Estimate dopt
1 from a study comparing
{C1,C2} and dopt
2 from separate studies comparing {M1, M2}
and {S1, S2}
• Delayed effects: The induction therapy with the highest
proportion of responders might have other effects that render
subsequent treatments less effective in regard to survival
• Require data from a study involving the same subjects over the
entire sequence of decisions
245

Statistical problem
Ultimate goal: Based on the data (5.9), estimate dopt ∈ D satisfying
(5.8), i.e.,
E{Y*
(dopt
)} ≥ E{Y*
(d)} for all d ∈ D
Challenge: dopt is deﬁned in terms of the potential outcomes (5.6)
• Must be able to express this deﬁnition in terms of the observed
data (5.9)
• In particular, for any d ∈ D, must be able to identify the
distribution of
{X1, X*
2(d1), X*
3(d2), . . . , X*
K (dK−1), Y*
(d)}
which depends on that of (X1, W*), from the distribution of
(X1, A1, X2, A2, . . . , XK , AK , Y)
• Possible under the following assumptions generalizing those in
(3.4), (3.5), and (3.6)
246

Identiﬁability assumptions
SUTVA (consistency):
Xk = X*
k (Ak−1) =
ak−1∈Ak−1
X*
k (ak−1)I(Ak−1 = ak−1), k = 2, . . . , K
(5.10)
Y = Y*
(A) =
a∈A
Y*
(a)I(A = a)
• Observed intervening information and the ﬁnal observed
outcome are those that would potentially be seen under the
treatments actually received at each decision point
• Are the same (consistent) regardless of how the treatments are
administered at each decision point
247

Sequential randomization assumption (SRA): Robins (1986)
W*
⊥⊥ Ak |Xk , Ak−1, k = 1, . . . , K, where A0 is null (5.11)
equivalently
W*
⊥⊥ Ak |Hk , k = 1, . . . , K
• (5.11) is the strongest version; weaker versions require
treatment selection ⊥⊥ only of future potential outcomes
• Unveriﬁable from the observed data
• Unlikely to hold for data from a longitudinal observational study
not carried out with estimation of treatment regimes in mind
• But (5.11) holds by design in a SMART
248

Positivity assumption: With feasible sets of treatment options at
each decision point, more complicated
• Intuitively: To identify the distribution of the potential outcomes
(5.6) from that of the observed data (5.9), all treatment options in
the feasible sets Ψk (hk ) = Ψk (xk , ak−1) must be represented in
the observed data, k = 1, . . . , K
• That is, there must be individuals in the data who received each
of the options in Ψk (hk ), k = 1, . . . , K
• E.g, acute leukemia: Decision 2: there must be responders who
received each of M1 and M2 and nonresponders who received
each of S1 and S2
Built up recursively. . .
249

Decision 1: Set of all possible baseline info h1 = x1
Γ1 = {x1 ∈ X1 satisfying P(X1 = x1) > 0} ⊆ X1 = H1
Set of all possible histories and associated options in Ψ1(h1)
Λ1 = {(x1, a1) such that x1 = h1 ∈ Γ1, a1 ∈ Ψ1(h1)}
All options in Λ1 must be represented in the data
P(A1 = a1|H1 = h1) > 0 for all (h1, a1) ∈ Λ1 (5.12)
First component of the positivity assumption
250

Decision 2: All possible histories h2 consistent with following a
Ψ-speciﬁc regime at Decision 1
Γ2 = (x2, a1) ∈ X2 × A1 satisfying (x1, a1) ∈ Λ1 and
P{X*
2(a1) = x2 | X1 = x1} > 0 ⊆ H2
Under SUTVA (5.10) and SRA (5.11), equivalently
Γ2 = (x2, a1) ∈ X2 × A1 satisfying (x1, a1) ∈ Λ1 and
P(X2 = x2 | X1 = x1, A1 = a1) > 0
Set of all possible histories h2 and associated options in Ψ2(h2)
Λ2 = {(x2, a2) such that (x2, a1) = h2 ∈ Γ2, a2 ∈ Ψ2(h2)}
251

Decision 2, continued: All options in Λ2 must be represented in the
data
P(A2 = a2 | H2 = h2) > 0 for all (h2, a2) ∈ Λ2
Second component of the positivity assumption
...
Decision k: All histories hk consistent with a Ψ-speciﬁc regime
through Decision k − 1
Γk = (xk ,ak−1) ∈ Xk × Ak−1 satisfying (xk−1, ak−1) ∈ Λk−1 and
P{X*
k (ak−1) = xk | X*
k−1(ak−2) = xk−1} > 0 ⊆ Hk (5.13)
= (xk ,ak−1) ∈ Xk × Ak−1 satisfying (xk−1, ak−1) ∈ Λk−1 and
P(Xk = xk | Xk−1 = xk−1, Ak−1 = ak−1) > 0 (5.14)
252

Decision k, continued: Set of all possible histories hk and
associated options in Ψk (hk )
Λk = {(xk , ak ) such that (xk , ak−1) = hk ∈ Γk , ak ∈ Ψk (hk )}
All options in Ψk (hk ) must be represented in the data
P(Ak = ak | Hk = hk ) > 0 for all (hk , ak ) ∈ Λk
kth component of the positivity assumption
...
253

Positivity assumption: Summarizing
P(Ak = ak |Hk = hk ) = P(Ak = ak |Xk = xk , Ak−1 = ak−1) > 0
for hk = (xk , ak−1) ∈ Γk and ak ∈ Ψk (hk ) = Ψk (xk , ak−1),
k = 1, . . . , K (5.15)
• The positivity assumption (5.15) holds in a SMART by design if
there are subjects with history hk randomized to all options in
Ψk (hk ) at each Decision k = 1, . . . , K
• No guarantee that for a given Ψ (5.15) holds for data from a
longitudinal observational study (more shortly)
254

Equivalence of (5.13) and (5.14): Assuming SUTVA, SRA, need to
show for any hk = (xk , ak−1) ∈ Γk in (5.13)
P(Xk = xk | Xk−1 = xk−1, Ak−1 = ak−1)
= P{X*
k (ak−1) = xk | X*
k−1(ak−2) = xk−1} (5.16)
• Proof is by induction (k = 1, 2 are immediate)
• Repeated use of the following lemma
Lemma. Let A and H be random variables, assume
W* ⊥⊥ A|H, and consider two functions f1(W*) and f2(W*) of
W*. If the event {f2(W*) = f2, H = h, A = a} has positive
probability, then
P{f1(W*
) = f1|f2(W*
) = f2, H = h, A = a}
= P{f1(W*
) = f1|f2(W*
) = f2, H = h}
255

Sketch of induction proof:
• We need to show for any hk = (xk , ak−1) ∈ Γk in (5.13), k = 1, . . . , K
P(Xk = xk | Xk−1 = xk−1, Ak−1 = ak−1)
= P{X*
k (ak−1) = xk | X*
k−1(ak−2) = xk−1} (5.16)
• (5.16) is trivial for k = 1
• k = 2: (5.12) implies P(X1 = x1, A1 = a1) > 0 for (x1, a1) ∈ Λ1, so
P(X2 = x2 | X1 = x1, A1 = a1) is well deﬁned. Then by SUTVA and
SRA, (5.16) holds
P(X2 = x2 | X1 = x1, A1 = a1) = P{X*
2(a1) = x2 | X1 = x1, A1 = a1}
= P{X*
2(a1) = x2 | X1 = x1}
• For general k: Need to show P(Xk−1 = xk−1, Ak−1 = ak−1) > 0, so
that P(Xk = xk | Xk−1 = xk−1, Ak−1 = ak−1) is well deﬁned, and then
show (5.16)
256

• Assume this holds for k − 1 and then show it holds for k. Thus, for
hk−1 = (xk−1, ak−2) ∈ Γk−1, P(Xk−2 = xk−2, Ak−2 = ak−2) > 0 and
P(Xk−1 = xk−1 | Xk−2 = xk−2, Ak−2 = ak−2)
= P{X*
k−1(ak−2) = xk−1 | X*
k−2(ak−3) = xk−2} > 0
• It follows that P(Xk−1 = xk−1, Ak−2 = ak−2) > 0
• Because hk−1 = (xk−1, ak−2) ∈ Γk−1 and ak−1 ∈ Ψk−1(xk−1, ak−2),
P(Ak−1 = ak−1 | Xk−1 = xk−1, Ak−2 = ak−2) > 0
• It then follows that P(Xk−1 = xk−1, Ak−1 = ak−1) > 0 as required
• Now show (5.16) using SUTVA, SRA, and the Lemma
257

• By repeated use of SUTVA and SRA
P(Xk = xk | Xk−1 = xk−1, Ak−1 = ak−1)
= P{X*
k (ak−1) = xk |Xk−2 = xk−2, X*
k−1(ak−2) = xk−1
Ak−3 = ak−3, Ak−2 = ak−2}, (5.17)
• By the Lemma with f1(W*
) = X*
k (ak−1), A = Ak−2, H = (Xk−2, Ak−3),
and f2(W*
) = X*
k−1(ak−2), (5.17) =
P{X*
k (ak−1) = xk |Xk−2 = xk−2, X*
k−1(ak−2) = xk−1, Ak−3 = ak−3}
= P{X*
k (ak−1) = xk |Xk−3 = xk−3, X*
k−2(ak−3) = xk−2
X*
k−1(ak−2) = xk−1, Ak−4 = ak−4, Ak−3 = ak−3} (5.18)
• By the Lemma with f1(W*
) = X*
k (ak−1), A = Ak−3, H = (Xk−3, Ak−4),
(5.18) =
P{X*
k (ak−1) = xk |Xk−3 = xk−3, X*
k−2(ak−3) = xk−2,
X*
k−1(ak−2) = xk−1, Ak−4 = ak−4}
258

• Continuing to apply the Lemma and SUTVA leads to (5.16),
P(Xk = xk | Xk−1 = xk−1, Ak−1 = ak−1)
= P{X*
k (ak−1) = xk | X*
k−1(ak−2) = xk−1}
• Because hk = (xk , ak−1) ∈ Γk , the RHS is > 0
• Thus, we have shown that (5.16) holds for k and
P(Xk = xk | Xk−1 = xk−1, Ak−1 = ak−1) > 0
for hk = (xk , ak−1) ∈ Γk
• Because this holds for k = 1, 2, the result follows by induction
259

More precise definition of a regime: A Ψ-specific regime
d = (d1, . . . , dK ) satisfies
• Each rule dk , k = 1, . . . , K, is a mapping from Γk ⊆ Hk into Ak
for which dk (hk ) ∈ Ψk (hk ) for every hk ∈ Γk
• The class D of Ψ-specific regimes is the set of all such d
260

Observational data: For given Ψ, no guarantee that all options in
Ψk (hk ), k = 1, . . . , K, are represented in the data
• Deﬁne for k = 1, . . . , K
Γmax
k = (xk , ak−1) ∈ Xk × Ak−1 satisfying (xk−1, ak−1) ∈ Λmax
k−1
and P(Xk = xk | Xk−1 = xk−1, Ak−1 = ak−1) > 0
Ψmax
k (hk ) ={ak ∈ Ak satisfying P(Ak = ak |Hk = hk ) > 0
for all hk = (xk , ak−1) ∈ Γmax
k }
Λmax
k = {(xk , ak ) such that (xk , ak−1) = hk ∈ Γmax
k , ak ∈ Ψmax
k (hk )}
• The class of regimes based on Ψmax = (Ψmax
1 , . . . , Ψmax
k ) is the
largest that can be considered
• So must have Ψk (hk ) ⊆ Ψmax
k (hk ), k = 1, . . . , K for all
hk ∈ Γk ⊆ Γmax
k
261

Fundamentals
5.5 Key References
262

Identiﬁability result
Goal, again: For any Ψ-speciﬁc regime d ∈ D, demonstrate that we
can identify the distribution of
{X1, X*
2(d1), X*
3(d2), . . . , X*
K (dK−1), Y*
(d)}
which depends on that of (X1, W*), from the distribution of
(X1, A1, X2, A2, . . . , XK , AK , Y)
Recall: Recursive representation dk (xk ) of the treatment options
selected by d through Decision k in (5.2) if an individual follows d,
k = 2, . . . , K
dk (xk ) = [d1(x1), d2{x2, d1(x1)}, . . . , dk {xk , dk−1(xk−1)}]
263

g-Computation algorithm
Main result: Under SUTVA (5.10), SRA (5.11), and positivity
assumption (5.15), the joint density of the potential outcomes
{X1, X*
2(d1), X*
3(d2), . . . , X*
K (dK−1), Y*(d)} can be obtained as
pX1,X*
2(d1),X*
3(d2),...,X*
K
(dK−1),Y*(d)(x1, . . . , xK , y) (5.19)
= pY|X ,A{y|x, d(x)}
× pXK |XK−1,AK−1
{xK |xK−1, dK−1(xK−1)}
... (5.20)
× pX2|X1,A1
{x2|x1, d1(x1)}
× pX1
(x1)
for any realization (x1, x2, . . . , xK , y) for which (5.19) is positive
• Due to Robins (1986, 1987, 2004)
264

Additional deﬁnition: Relevant realizations (x1, x2, . . . , xK , y) are
determined by feasible sets
• Assume Y*(a) and Y take values y ∈ Y
• Deﬁne
ΓK+1 = (x,a, y) ∈ X × A × Y satisfying (x, a) ∈ ΛK and
P{Y*
(a) = y | X*
K (aK−1) = xK } > 0
= (x,a, y) ∈ X × A × Y satisfying (x, a) ∈ ΛK and
P(Y = y | X = x, AK−1 = aK−1) > 0
• This equality can be shown similarly to that of (5.13) and (5.14)
265

Simpliﬁcation: Take all random variables discrete, so that (5.19) and
(5.20) become
P{X1 = x1, X*
2(d1) = x2, . . . , X*
K (dK−1) = xK , Y*
(d) = y} (5.21)
= P{Y = y | XK = xK , AK = dK (xK−1)}
× P(XK = xK | XK−1 = xK−1, AK−1 = dK−1(xK−2)}
... (5.22)
× P{X2 = x2 | X1 = x1, A1 = d1(x1)}
× P(X1 = x1)
for any realization (x1, x2, . . . , xK , y) such that
P{X1 = x1, X*
2(d1) = x2, . . . , X*
K (dK−1) = xK , Y*
(d) = y} > 0
• Need to show (5.21) = (5.22)
266

Demonstration: Factorize (5.21) as
P{X1 = x1, X*
2(d1) = x2, . . . , X*
K (dK−1) = xK , Y*
(d) = y}
= P{Y*
(d) = y | X*
K (dK−1) = xK }
× P{X*
K (dK−1) = xK | X*
K−1(dK−2) = xK−1}
...
× P{X*
2(d1) = x2 | X1 = x1}
× P(X1 = x1)
• All components on RHS are positive because (5.21) > 0
267

• From (5.22), it sufﬁces to show
P{Y*
(d) = y | X*
K (dK−1) = xK }
= P{Y = y | XK = xK , AK = dK (xK−1)}
where P{Y = y | XK = xK , AK = dK (xK−1)} > 0
and for k = 2, . . . , K
P{X*
k (dk−1) = xk | X*
k−1(dk−2) = xK−1}
= P(Xk = xk | Xk−1 = xk−1, Ak−1 = dk−1(xk−2)}
where P(Xk = xk | Xk−1 = xk−1, Ak−1 = dk−1(xk−2)} > 0
• These follow immediately if, for X*
K+1(d) = Y*(d)
{xk , dk−1(xk−1)} ∈ Γk , k = 2, . . . , K + 1 (5.23)
which can be shown by induction (ﬁrst show for k = 2)
268

Sketch of induction proof: First take k = 2
• Because P(X1 = x1) > 0, x1 ∈ Γ1 and d is a Ψ-speciﬁc regime,
d1(x1) ∈ Ψ1(h1), so that {x1, d1(x1)} ∈ Λ1
• Because (5.21) > 0,
P{X*
2(d1) = x2 | X1 = x1} > 0 =⇒ {x2, d1(x1)} ∈ Γ2
which is (5.23) for k = 2
• Now assume {xk−1, dk−2(xk−2)} ∈ Γk−1 is true
• Because d is a Ψ-speciﬁc regime, dk−1(xk−1) ∈ Ψk−1(hk ) and thus
{xk−1, dk−1(xk−1)} ∈ Λk−1
• Because (5.21) > 0,
P{X*
k (dk−1) = xk | X*
k−1(dk−2) = xk−1} > 0 =⇒ {xk , dk−1(xk−1)} ∈ Γk
completing the induction proof
269

General result: Compactly stated
pX1,X*
2(d1),X*
3(d2),...,X*
K
(dK−1),Y*(d)(x1, . . . , xK , y) (5.24)
= pY|X ,A{y|x, dK (x)}
K
k=2
pXk |Xk−1,Ak−1
{xk |xk−1, dk−1(xk−1)} pX1
(x1)
which implies, for example,
pY*(d)(y) =
X
pY|X ,A{y|x, d(x)} (5.25)
×
K
k=2
pXk |Xk−1,Ak−1
{xk |xk−1, dk−1(xk−1)} pX1
(x1) dνK (xK ) · · · dν1(x1)
• dνK (xK ) · · · dν1(x1) is the dominating measure
270

Or the value
E{Y*
(d)} =
X
E{Y|X = x, A = d(x)} (5.26)
×
K
k=2
pXk |Xk−1,Ak−1
{xk |xk−1, dk−1(xk−1)} pX1
(x1) dνK (xK ) · · · dν1(x1)
• Thus, the value V(d) = E{Y*
(d)} of a regime d ∈ D can be expressed
in terms of the observed data
• So it should be possible to estimate V(d) from these data
• As well as to estimate V(dopt
) (later). . .
271

Fundamentals
5.5 Key References
272

Fixed regime d ∈ D
Of interest: Estimation of the value V(d) = E{Y*(d)} of a given, or
ﬁxed, Ψ-speciﬁc regime d ∈ D
• In its own right
• As a stepping stone to estimation of an optimal regime dopt
• We consider several methods
• Throughout, take SUTVA (5.10), SRA (5.11), and positivity
assumption (5.15) to hold
273

Estimation via g-computation
In principle: From (5.25) and (5.26), can estimate pY*(d)(y) or
E{Y*(d)} for ﬁxed d ∈ D
• Posit parametric models, e.g., for (5.25)
pY|X ,A(y|x, a; ζK+1)
pXk |Xk−1,Ak−1
(xk |xk−1, ak−1; ζk ), k = 2, . . . , K
pX1
(x1; ζ1)
depending on ζ = (ζT
1 , . . . , ζT
K+1)T (or for (5.26) a model for
E(Y|X = x, A = a) instead)
• Estimate ζ by maximizing the partial likelihood
n
i=1
pY|X ,A(Yi |Xi , Ai ; ζK+1)
K
k=2
pXk |Xk−1,Ak−1
(Xki |Xk−1,i , Ak−1,i ; ζk ) pX1
(X1i ; ζ1)
in ζ to obtain ζ = (ζT
1 , . . . , ζT
K+1)T
274

In principle:
• Substitute the ﬁtted models in (5.25) or (5.26)
Major obstacle: (5.25) and (5.26) involve integration over the sample
space X = X1 × · · · × XK
• Except in the simplest situations, e.g., all x1, . . . , xK discrete and
low-dimensional, the required integration is almost certainly
analytically intractable and computationally insurmountable
275

Monte Carlo integration: Robins (1986) proposed approximating
the distribution of Y*(d) for d ∈ D; for r = 1, . . . , M, simulate a
realization from the distribution of Y*(d) as follows
1. Generate random x1r from pX1
(x1; ζ1)
2. Generate random x2r from pX2|X1,A1
{x2|x1r , d1(x1r ); ζ2}
3. Continue in this fashion, generating random xkr from
pXk |Xk−1,Ak−1
{xk |xk−1,r , dk−1(xk−1,r ); ζk }, k = 3, . . . , K
4. Generate random yr from pY|X ,A{y|xr , dK (xr ); ζK+1}
y1, . . . , yM are a sample from the ﬁtted distribution of Y*(d)
Estimator for V(d) = E{Y*(d)}: VGC(d) = M−1
M
r=1
yr
276

Practical challenges:
• Development of models can be daunting due to high
dimension/complexity of x1, . . . , xK
• Although specifying pY|X ,A(y|x, a; ζK+1) may be feasible for
univariate Y, models
pXk |Xk−1,Ak−1
(xk |xk−1, ak−1; ζk ), k = 2, . . . , K, pX1
(x1; ζ1)
for multivariate Xk are more challenging to specify
• E.g., Xk = (XT
k1, XT
k2)T continuous/discrete, can factor as
pXk1|Xk2,Xk−1,Ak−1
(xk1|xk2, xk−1, ak−1; ζk1)
× pXk2|Xk−1,Ak−1
(xk2|xk−1, ak−1; ζk2)
or pXk2|Xk1,Xk−1,Ak−1
(xk2|xk1, xk−1, ak−1; ζk2)
× pXk1|Xk−1,Ak−1
(xk2|xk−1, ak−1; ζk1)
277

Practical challenges, continued:
• Moreover, simulation from such models can be demanding
• Analytical derivation of approximate standard errors is not
straightforward (frankly daunting!); use of a nonparametric
bootstrap has been advocated, which is clearly highly
computationally intensive
Bottom line: Estimation of V(d) via the g-computation is not
commonplace in practice
• The main usefulness of g-computation is as a demonstration that
it is possible to identify and estimate V(d) from observed data
under SUTVA, SRA, and positivity
• In principle, a possible approach to estimating dopt ∈ D is to
maximize VGC(d) over all d ∈ D; clearly, this would be a
formidable computational challenge (and is never done in
practice)
278

Inverse probability weighted estimator
Motivation: Alternative representation of E{Y*(d)} in terms of the
observed data
(X1, A1, X2, A2, . . . , XK , AK , Y)
depending on
pA1|H1
(a1|h1) = P(A1 = a1|H1 = h1) = pA1|X1
(a1|x1)
pAk |Hk
(ak |hk ) = P(Ak = ak |Hk = hk ) = pAk |Xk ,Ak−1
(ak |xk , ak−1)
k = 2, . . . , K
Deﬁne: Evaluate at d1(X1), dk {Xk , dk−1(Xk−1)}
πd,1(X1) = pA1|X1
{d1(X1)|X1}
πd,k (Xk ) = pAk |Xk ,Ak−1
[dk {Xk , dk−1(Xk−1)}|Xk , dk−1(Xk−1)]
k = 2, . . . , K
279

Deﬁne: Indicator of consistency of options actually received with
those selected by d at all K decisions
Cd = CdK
= I{A1 = d1(X1), . . . , AK = dK (XK )} = I{A = d(X )}
Inverse probability weighted estimator for V(d) = E{Y*(d)}:
VIPW (d) = n−1
n
i=1
Cd,iYi
K
k=2 πd,k (Xki) πd,1(X1i)
(5.27)
• (5.27) is an unbiased estimator for V(d) because
E

 Cd Y
K
k=2 πd,k (Xk ) πd,1(X1)

 = E{Y*
(d)} (5.28)
under SUTVA (5.10), SRA (5.11) and positivity assumption (5.15)
280

Sketch of proof of (5.28): We show a more general result
E

 Cd f(X , Y)
K
k=2 πd,k (Xk ) πd,1(X1)

 = E[f{X*
K (dK−1), Y*
(d)}] (5.29)
so that (5.28) follows by taking f(x, y) = y
• Using SUTVA and (5.7)
E

 Cd f(X , Y)
K
k=2 πd,k (Xk ) πd,1(X1)

 = E

 Cd f{X*
K (dK−1), Y*
(d)}
K
k=2 πd,k {X*
k (dk−1)} πd,1(X1)


= E



E

 I{A = d(X )}f{X*
K (dK−1), Y*
(d)}
K
k=2 πd,k {X*
k (dk−1)} πd,1(X1)
X1, W*





= E

P{A = d(X )|X1, W*
}f{X*
K (dK−1), Y*
(d)}
K
k=2 πd,k {X*
k (dk−1)} πd,1(X1)

 (5.30)
281

From (5.30): Must show
K
k=2
πd,k {X*
k (dk−1)} πd,1(X1) = P{A = d(X )|X1, W*
} > 0 (5.31)
• For (x1, w) such that P(X1 = x1, W*
= w) > 0, show
P{A = d(X )|X1 = x1, W*
= w} > 0
• There exist x2, . . . , xK such that P{X*
k (dk−1) = xk } > 0, k = 2, . . . , K,
because X*
K (dK−1) is a function of W*
• Using SUTVA
P{A = d(X )|X1 = x1, W*
= w}
= P{A1 = d1(x1)|X1 = x1, W*
= w}
×
K
k=2
P[Ak = dk {xk , dk−1(xk−1)}|Ak−1 = dk−1(xk−1), X1 = x1, W*
= w]
282

• Must show each term in this factorization is well deﬁned; true if
P{Ak = dk (xk ), X1 = x1, W*
= w} > 0, k = 1, . . . , K
• By induction; when k = 1
P{A1 = d1(x1), X1 = x1, W*
= w}
= P{A1 = d1(x1) | X1 = x1, W*
= w}P(X1 = x1, W*
= w)
is positive if P{A1 = d1(x1) | X1 = x1, W*
= w} > 0
• This holds because P(X1 = x1) > 0 so x1 ∈ Γ1 and d1(x1) ∈ Ψ1(x1) so
by positivity assumption
P{A1 = d1(x1) | X1 = x1, W*
= w} = P{A1 = d1(x1) | X1 = x1} > 0
• Now assume P{Ak = dk (xk ), X1 = x1, W*
= w} > 0 and show
P{Ak+1 = dk+1(xk+1), X1 = x1, W*
= w} > 0
283

• X*
k (dk−1) includes X1, so (X1 = x1, W*
= w) and
{X*
k+1(d) = xk+1, W*
= w} are equivalent, and thus
P{Ak+1 = dk+1(xk+1), X1 = x1, W*
= w}
= P[Ak+1 = dk+1{xk+1, dk (xk )}|Ak = dk (xk ), X*
k+1(dk ) = xk+1, W*
= w]
× P{Ak = dk (xk ), X1 = x1, W*
= w}
• Must show ﬁrst RHS term is > 0; by SUTVA and SRA, this term =
= P[Ak+1 = dk+1{xk+1, dk (xk )}|Xk+1 = xk+1, Ak = dk (xk ), W*
= w]
= P[Ak+1 = dk+1{xk+1, dk (xk )}|Xk+1 = xk+1, Ak = dk (xk )]
• Because {x1, d1(x1)} ∈ Λ1 and P{X*
k (dk−1) = xk } > 0, the argument
leading to (5.23) yields {xk+1, dk (xk )} ∈ Γk+1,
dk+1{xk+1, dk (xk )} ∈ Ψk+1(hk+1), so this term is > 0
284

• Applying these results yields
P{A = d(X )|X1, W*
} = pA1|X1
{d1(X1)|X1}
×
K
k=2
pAk |Xk ,Ak−1
[dk {Xk , dk−1(Xk−1)}|Xk , dk−1(Xk−1)]
= πd,1(X1)
K
k=2
πd,k {X*
k (dk−1)}
which is the desired equality
285

Result: (5.28) holds
E

 Cd Y
K
k=2 πd,k (Xk ) πd,1(X1)

 = E{Y*
(d)}
• VIPW (d) is an unbiased estimator for V(d)
• Alternative representation of E{Y*(d)} in terms of observed data
• Taking instead for ﬁxed y
f{X*
K (dK−1), Y*
(d)} = I{Y*
(d) = y}
yields an alternative representation of the marginal density of
Y*(d)
286

More generally: For ﬁxed (x1, x2, . . . , xK , y), treating all variables as
discrete, taking
f{X*
K (dK−1), Y*
(d)}
= I{X1 = x1, X*
2(d1) = x2, . . . , X*
K (dK−1) = xK , Y*
(d) = y}
yields an alternative representation of the joint density
P{X1 = x1, X*
2(d1) = x2, . . . , X*
K (dK−1) = xK , Y*
(d) = y}
= pX1,X*
2(d1),X*
3(d2),...,X*
K
(dK−1),Y*(d)(x1, . . . , xK , y)
= E

Cd I(X1 = x1, X2 = x2, . . . , XK = xK , Y = y)
K
k=2 πd,k (Xk ) πd,1(X1)


287

Denominator:
K
k=2
πd,k (Xk ) πd,1(X1)
• Can be interpreted as the propensity for receiving treatment
consistent with regime d through all K decisions given observed
history
• Depends on the propensities of treatment given observed history
pAk |Hk
(ak |hk ) = P(Ak = ak | Hk = hk ), k = 1, . . . , K
• In practice: Posit and ﬁt models for the propensities depending
on parameters γk and estimators γk , k = 1, . . . , K;
considerations for this momentarily
• These models induce models πd,1(X1; γ1) and πd,k (Xk ; γk )
288

In practice: IPW estimator
VIPW (d) = n−1
n
i=1
Cd,iYi
K
k=2 πd,k (Xki; γk ) πd,1(X1i; γ1)
(5.32)
• Consistent estimator for V(d) as long as the propensity models
are correctly speciﬁed
289

Alternative estimator:
VIPW∗ (d) =


n
i=1
Cd,i
K


−1
×
n
i=1
Cd,iYi
K
(5.33)
• Weighted average, also a consistent estimator
• Can be considerably more precise than VIPW (d)
290

Considerations for propensity modeling
Simplest case: Two options at each decision point, feasible for all
individuals, Ak = {0, 1}, k = 1, . . . , K
• Can work with propensity scores
πk (hk ) = P(Ak = 1|Hk = hk ) = πk (xk , ak−1), k = 1, . . . , K
pA1|X1
(a1|x1) = pA1|H1
(a1|h1) = π1(h1)a1
{1 − π1(h1)}1−a1
pAk |Xk ,Ak−1
(ak |xk , ak−1) = pAk |Hk
(ak |hk ) = πk (hk )ak
{1 − πk (hk )}1−ak
k = 2, . . . , K
• Using (5.2)
πd,1(X1) = π1(X1)d1(X1)
{1 − π1(X1)}1−d1(X1)
πd,k (Xk ) = πk {Xk , dk−1(Xk−1)}dk {Xk ,dk−1(Xk−1)}
× [1 − πk {Xk , dk−1(Xk−1)}]1−dk {Xk ,dk−1(Xk−1)}
,
k = 2, . . . , K
291

• Can posit a parametric model πk (hk ; γk ) for each k; e.g., logistic
regression models as in (3.12)
πk (hk ; γk ) =
exp(γk1 + γT
k2hk )
1 + exp(γk1 + γT
k2hk )
, γk = (γk1, γT
k2)T
, k = 1, . . . , K
hk = (1, hT
k )T , k = 1, . . . , K, and ﬁt via maximum likelihood to
obtain γk , k = 1, . . . , K
292

More than 2 options: Feasible for all individuals, Ak = {1, . . . , mk }
• As on Slide 170, with
ωk (hk , ak ) = P(Ak = ak |Hk = hk ), k = 1, . . . , K
or ωk (xk , ak−1, ak ) = P(Ak = ak |Xk = xk , Ak−1 = ak−1)
where
ωk (hk , mk ) = 1 −
mk −1
ak =1
ωk (hk , ak )
• Then
πd,1(X1) =
m1
a1=1
ω1(X1, a1)I{d1(X1)=a1}
=
m1
a1=1
I{d1(X1) = a1}ω1(X1, a1)
πd,k (Xk ) =
mk
ak =1
ωk {Xk dk−1(Xk−1), ak }I[dk {Xk ,dk−1(Xk−1)}=ak ]
,
=
mk
ak =1
I[dk {Xk , dk−1(Xk−1)} = ak ]ωk {Xk , dk−1(Xk−1), ak }
k = 2, . . . , K293

• Can posit parametric models ωk (hk , ak ; γk ) for each k e.g.,
multinomial (polytomous) logistic regression models
ωk (hk , ak ; γk ) =
exp(hT
k γk,ak
)
1 + mk −1
j=1 exp(hT
k γkj)
, ak = 1, . . . , mk − 1
hk = (1, hT
k )T , γk = (γT
k1, . . . , γT
k,mk −1)T and ﬁt via maximum
likelihood to obtain γk , k = 1, . . . , K
294

Feasible sets: k distinct subsets, each with ≥ 2 options in Ak
Ak,l = {1, . . . , mkl}, , l = 1, . . . , k , k = 1, . . . , K
• For k = 1, . . . , K, ak ∈ {1, . . . , mkl} = Ak,l
ωk,l(hk ,ak ) = P(Ak = ak |Hk = hk )
ωk,l(hk , mkl) = 1 −
mkl −1
ak =1
ωk,l(hk , ak )
• For k = 1, . . . , K, can posit k separate logistic or multinomial
(polytomous) logistic regression models
ωk,l(hk , ak ; γkl), l = 1, . . . , k
295

• For each k = 1, . . . , K, implies an overall model
ωk (hk , ak ; γk ) =
k
l=1
I{sk (hk ) = l} ωk,l(hk , ak ; γkl), γk = (γT
k1, . . . , γT
k k
)T
understood that ak takes values in the relevant distinct subset
• For subsets with a single option, P(Ak = ak | Hk = hk ) = 1, and
no model is needed
• Induced models
πd,1(X1; γ1) =
1
l=1
I{s1(h1) = l}
m1l
a1=1
ω1,l (X1, a1; γ1l )I{d1(X1)=a1}
πd,k (Xk ; γk ) =
k
l=1
I{sk (hk ) = l}
×
mkl
ak =1
ωk,l {Xk , dk−1(Xk−1), ak ; γkl }I[dk {Xk ,dk−1(Xk−1)}=ak )]
296

In a SMART: Propensities are known
• As on Slide 90, preferable to estimate the propensities
• Estimate P(Ak = ak | Hk = hk ) by sample proportions corresponding to
each distinct subset Ak,l
• Example: Acute leukemia, 2 = 2, r2 component of h2 indicating
response
Ψ2(h2) = {M1, M2} = A2,1, r2 = 1
= {S1, S2} = A2,2, r2 = 0
• Estimate P(A2 = a2 | H2 = h2) for a2 ∈ A2,1 by
n
i=1
I(R2i = 1)
−1 n
i=1
I(R2i = 1)I(A2i = a2)
• Estimate P(A2 = a2 | H2 = h2) for a2 ∈ A2,2 by
n
i=1
I(R2i = 0)
−1 n
i=1
I(R2i = 0)I(A2i = a2)
297

Equivalent representation: For both VIPW (d) and VIPW∗(d)
• When Cd = 1, A1 = d1(X1), and Ak = dk {Xk , dk−1(Xk−1)},
k = 2, . . . , K, and it is straightforward that the denominator
K
k=2
πd,k (Xki; γk ) πd,1(X1i; γ1)
can be replaced by the ﬁtted model for
K
k=1
pAk |Hk
(Aki|Hki)
without altering the value of either estimator
298

• E.g., with 2 feasible options at each decision point, replace the
denominator by
K
k=1
πk (Hki; γk )Aki {1 − πk (Hki; γk )}1−Aki
• With more than 2 options, by
K
k=1
ωk (Hki, Aki; γk )
• In some literature accounts, the estimators are deﬁned with
these quantities in the denominators
• This will be important later
299

Remarks:
• VIPW (d) and VIPW∗(d) are consistent estimators as long as the
propensity models are correctly speciﬁed; can be inconsistent
otherwise
• Except for estimation of propensities, these estimators use data
only from subjects for whom A = d(X )
• For larger K, there are likely very few such subjects
• These estimators can be unstable in ﬁnite samples due to
division by propensity of treatment consistent with d small and
can exhibit large sampling variation
• As in the single decision case, even if the propensities are
known, it is preferable to estimate them via maximum likelihood
• Large sample approximations to the sampling distributions of
VIPW (d) and VIPW∗(d) follow from M-estimation theory
300

Augmented inverse probability weighted estimator
Drop out analogy: For individuals for whom A = d(X )
• Analogy to a monotone coarsening problem; i.e., “drop out”
• Under SUTVA, an individual with
A1 = d1(X1), . . . , Ak−1 = dk−1{Xk−1, dk−2(Xk−2)}
but for whom
Ak = dk {Xk , dk−1(Xk−1)}
has X2 = X*
2(d1), . . . , Xk = X*
k (dk−1), i.e., Xk = X*
k (dk−1)
• But his Xk+1, . . . , XK and Y do not reﬂect the potential outcomes
he would have if he had continued to follow rules dk , . . . , dK
• Effectively, in terms of receiving treatment options consistent
with d, the individual has “dropped out” at Decision k
301

Improvement: From the perspective of information on
{X1, X*
K (dK−1), Y*(d)} and especially Y*(d)
• X*
k (dk−1) is observed, but X*
k+1(dk ), . . . , X*
K (dK−1), Y*(d) are
“missing”
• Can we exploit the partial information from such individuals to
gain efﬁciency?
Dropout analogy: Under SUTVA, SRA, and positivity, this “drop out”
is according to a missing (coarsening) at random mechanism (shown
by Zhang et al., 2013), allowing semiparametric theory for monotone
coarsening at random to be used (Robins et al., 1994; Tsiatis, 2006)
302

Deﬁne:
• Indicator of treatment options consistent with d through Decision
k
Cdk
= I{Ak = dk (Xk )}, k = 1, . . . , K
with Cd0
≡ 1
• For brevity, write πd,1(X1) = πd,1(X1) and
πd,k (Xk ) =



k
j=2
πd,j(Xj)



πd,1(X1), k = 2, . . . , K
with πd,0 ≡ 1
• Substitution of propensity models leads to models
πd,k (Xk ; γk ), k = 1, . . . , K
γk = (γT
1 , . . . , γT
k )T , k = 1, . . . , K
303

Analogous to AIPW estimator for single decision: Under these
conditions, if the models for the propensities
pAk |Hk
(ak |hk ), k = 1, . . . , K
are correctly speciﬁed, from semiparametric theory, all consistent and
asymptotically normal estimators for V(d) for ﬁxed d ∈ D are
asymptotically equivalent to an estimator of the form
VAIPW (d) = n−1
n
i=1


Cd,iYi
K
(5.34)
+
K
k=1
Cdk−1,i
πd,k−1(Xk−1,i; γk−1)
−
Cdk ,i
πd,k (Xki, γk )
Lk (Xki)


• Lk (xk ) are arbitrary functions of xk , k = 1, . . . , K
• γk = (γT
1 , . . . , γT
k )T , k = 1, . . . , K
304

Features of VAIPW (d):
• Taking Lk (xk ) ≡ 0, k = 1, . . . , K, yields VIPW (d) in (5.32)
• When K = 1, because Cd0
≡ 1, πd,0 ≡ 1, and X1 = H1, (5.34)
reduces to the AIPW estimator (3.16) for the single decision case
• If all K propensity models are correctly speciﬁed, so there are
true values γk,0 of γk , k = 1, . . . , K, the “augmentation term” in
(5.34) evaluated at γk,0, k = 1, . . . , K, converges in probability to
zero for arbitrary Lk (xk ), k = 1, . . . , K
• Thus, (5.34) is a consistent estimator for V(d) with asymptotic
variance depending on the choice of Lk (xk )
305

Efficient estimator: From semiparametric theory, the estimator with
smallest asymptotic variance among those in the class (5.34) takes
Lk (xk ) = E{Y*
(d) | X*
k (dk−1) = xk }, k = 1, . . . , K (5.35)
• The conditional expectations in (5.35) are functionals of the
distribution of the potential outcomes
{X1, X*
2(d1), . . . , X*
k (dk−1), Y*(d)} and are unknown in practice
• As in the single decision case, posit and fit models
Qd,k (xk ; βk ), k = 1, . . . , K, (5.36)
for E{Y*(d) | X*
k (dk−1) = xk }, k = 1, . . . , K, and substitute in
(5.34)
• We present approaches to developing and fitting models (5.36)
when we discuss optimal regimes later
306

Result: Given estimators βk , k = 1, . . . , K, obtained as we discuss
later, the AIPW estimator is
VAIPW (d) = n−1
n
i=1


Cd,iYi
K
(5.37)
+
K
k=1
Cdk−1,i
πd,k−1(Xk−1,i; γk−1)
−
Cdk ,i
πd,k (Xk,i, γk )
Qd,k (Xki; βk )


• VAIPW (d) in (5.37) is doubly robust; i.e., is consistent if either (1)
the models for the propensities and thus for πd,1(x1) and
πd,k (xk ), k = 2, . . . , K, or (2) the models for for
E{Y*(d) | X*
k (dk ) = xk } in (5.36) are correctly speciﬁed
• If the observed data are from a SMART, the propensities are
known, and VAIPW (d) is guaranteed to be consistent regardless
of the models (5.36)
307

Efﬁcient estimator: If all models are correct, VAIPW (d) is efﬁcient
among estimators in class (5.37), achieving the smallest asymptotic
variance
Remarks:
• VAIPW (d) can exhibit considerably less sampling variation than
the simple IPW estimators (5.32) and (5.33)
• As for VIPW (d) and VIPW∗(d), large sample approximate
sampling distribution can be obtained via M-estimation theory
(although pretty involved)
• Zhang et al. (2013) propose (5.37) in a different but equivalent
form following directly from that in Tsiatis (2006)
308

Estimation via marginal structural models
Alternative approach: When scientiﬁc interest focuses on regimes
with simple rules that can be represented in terms of a
low-dimensional parameter η
• Formally, restrict to a subset Dη ⊂ D with elements
dη = {d1(h1; η1), . . . , dK (hK ; ηK )}, η = (ηT
1 , . . . , ηT
K )T
• Goal: Estimate the value of a ﬁxed regime dη∗ ∈ Dη
corresponding to a particular η∗, V(dη∗ ) = E{Y*(dη∗ )}
• As in the example coming next, Dη may be even simpler, with
η = η1 = · · · = ηK
309

Example: Treatment of HIV-infected patients
• K monthly clinic visits, decision on whether or not to administer
antiretrovial (ARV) therapy for the next month based on CD4
T-cell count (cells/mm3) (larger is better)
• Two feasible options: 1 = ARV therapy for the next month or 0 =
no ARV therapy for the next month
• Focus on rules involving a common CD4 threshold η below
which ARV therapy is administered and above which it is not
dk (hk ; η) = I(CD4k ≤ η), k = 1, . . . , K
CD4k = CD4 T cell count (cells/mm3) immediately prior to
Decision k
• Dη comprises regimes dη with rules of this form
• Final outcome: Y*(dη) = negative viral load (viral RNA
copies/mL) measured 1 month after Decision K if administered
ARV therapy using rules in dη
310

Focus: Estimation of V(dη∗ ) for particular threshold η∗
• Ideal: Data from a study where HIV patients were given ARV
therapy according to dη∗ and viral load was ascertained after
Decision K
• More likely: Available data are observational, with information
Xk , k = 1, . . . , K, including CD4k , options Ak actually received,
and −ﬁnal viral load Y recorded
• In these data, there are likely very few if any individuals who
received ARV therapy according to the rules in dη∗
• An approach to using these data to estimate V(dη∗ ) is suggested
by work of Orellana et al. (2010ab)
311

Marginal structural model: A model for V(dη) as a function of η
• I.e., V(dη) = µ(η) for some function µ( · ), posit a parametric
model
V(dη) = E{Y*
(dη)} = µ(η; α) (5.38)
referred to as a marginal structural model (MSM)
• For example, a quadratic model
µ(η; α) = α1 + α2η + α3η2
, α = (α1, α2, α3)T
• Estimate α based on the data by an appopriate method to obtain
α, and estimate V(dη∗ ) by
VMSM(dη∗ ) = µ(η∗
; α)
• η plays the role of “covariate” and µ(η∗; α) is the “predicted
value” at the particular value η∗ of interest
312

Hope: The MSM is a correct speciﬁcation of the true relationship
µ(η) across a plausible range of thresholds of interest
Estimation of α: Ideally, data from a prospective randomized study
• Each subject i = 1, . . . , n is randomized to one of m
predetermined thresholds η(j), j = 1, . . . , m, in the range of
interest and receives ARV according to dη(j)
• Natural estimator α solves in α the estimating equation
n
i=1
m
j=1
Cdη(j)
,i
∂µ(η(j); α)
∂α
w(η(j)){Yi − µ(η(j); α)} = 0 (5.39)
where w(η) is a weight function, Cdη
= I{A = dη(X )}
• w(η) ≡ 1 yields OLS estimation
• Each subject i has treatment experience consistent exactly one
of the η(j), j = 1, . . . , m
• (5.39) is an unbiased estimating equation if µ(η; α) is correct
313

Observtional data: Approach is motivated by (5.39)
• Some individuals have treatment experience consistent with ≥ 1
values of η, e.g., with K = 2
Experience Consistent with
CD41=300, A1 = 1, CD42=400, A2 = 0 η ∈ (300, 400)
CD41=300, A1 = 1, CD42=400, A2 = 1 η > 400
CD41=300, A1 = 0, CD42=400, A2 = 0 η < 300
• Others have treatment experience consistent with no value of η
Experience Consistent with
CD41=300, A1 = 0, CD42=400, A2 = 1 no η
314

Observational data: Estimator α is solution to
n
i=1 Dη


Cdη,i
K
k=2 πdη,k (Xki; γk ) πdη,1(X1i; γ1)
×
∂µ(η; α)
∂α
w(η){Yi − µ(η; α)}

 dν(dη) = 0
• dν(dη) is an appropriate dominating measure on dη ∈ Dη
• w(η) is a weight function
• A given individual i can contribute information on multiple
thresholds or none at all
• As in the IPW estimators, each of her contributions is weighted
by the reciprocal of an estimator for the propensity of receiving
treatment consistent with dη given observed history
315

For example: Interest in η ∈ [100, 500], for j = 1, . . . , m, partition
η(j) = 100 + 400(j − 1)/(m − 1), dν(dη) places point mass on η(j)
n
i=1
m
j=1


Cdη(j)
,i
K
k=2 πη(j),k (Xki; γk ) πη(j),1(X1i; γ1)
×
∂µ(η(j); α)
∂α
w(η(j)){Yi − µ(η(j); α)}

 = 0
• Under SUTVA, SRA, and positivity, if µ(η; α) and the propensity
models are correctly speciﬁed, these estimating equations can
be shown to be unbiased, and α is an M-estimator
• Here, if there is a sufﬁcient # of individuals who received
treatment consistent with at least one η(j), j = 1, . . . , m, α should
be a reasonable estimator in practice
• An augmented version of the estimating equation is possible
316

Fundamentals
5.5 Key References
317

References
Orellana, L., Rotnitzky, A., and Robins, J. M. (2010a). Dynamic
regime marginal structural mean models for estimation of optimal
dynamic treatment regimes, part I: Main content. The International
Journal of Biostatistics, 6.
Orellana, L., Rotnitzky, A., and Robins, J. M. (2010b). Dynamic
regime marginal structural mean models for estimation of optimal
dynamic treatment regimes, part II: Proofs and additional results. The
International Journal of Biostatistics, 6.
Robins, J. (1986). A new approach to causal inference in mortality
studies with sustained exposure periods - application to control of the
healthy worker survivor effect. Mathematical Modelling, 7,
1393–1512.
Robins, J. (1987). Addendum to: A new approach to causal inference
in mortality studies with sustained exposure periods - application to
control of the healthy worker survivor effect. Computers and
Mathematics with Applications, 14, 923–945.318

References
Robins, J. M. (2004). Optimal structural nested models for optimal
sequential decisions. In Lin, D. Y. and Heagerty, P., editors,
Proceedings of the Second Seattle Symposium on Biostatistics,
pages 189–326, New York. Springer.
Zhang, B., Tsiatis, A. A., Laber, E. B., and Davidian, M. (2013).
Robust estimation of optimal dynamic treatment regimes for
sequential treatment decisions. Biometrika, 100, 681–694.
Zhao, Y., Zeng, D., Rush, A. J., and Kosorok, M. R. (2012). Estimating
individual treatment rules using outcome weighted learning. Journal
of the American Statistical Association, 107, 1106–1118.
319

2019 PMED Spring Course - Multiple Decision Treatment Regimes: Fundamentals - Marie Davidian, February 13, 2019

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to 2019 PMED Spring Course - Multiple Decision Treatment Regimes: Fundamentals - Marie Davidian, February 13, 2019

Similar to 2019 PMED Spring Course - Multiple Decision Treatment Regimes: Fundamentals - Marie Davidian, February 13, 2019 (20)

More from The Statistical and Applied Mathematical Sciences Institute

More from The Statistical and Applied Mathematical Sciences Institute (20)

Recently uploaded

Recently uploaded (20)

2019 PMED Spring Course - Multiple Decision Treatment Regimes: Fundamentals - Marie Davidian, February 13, 2019