Causal Inference Opening Workshop - New Statistical Learning Methods for Estimating the Optimal Dynamic Treatment Regime - Lu Wang, December 10, 2019

New Statistical Learning Methods for Estimating the
Optimal Dynamic Treatment Regime
Lu Wang
Associate Professor
University of Michigan, Ann Arbor
luwang@umich.edu
December 10, 2019
Lu Wang ACWL and T-RL for Optimal DTR December 10, 2019 1 / 27

Overview
1 Introduction: Dynamic Treatment Regime (DTR)
2 Adaptive Contrast Weighted Learning (ACWL)
3 Tree-based Reinforcement Learning (T-RL)
4 Summary and Other Ongoing Research

Motivating Example: Cancer Management
Esophageal Cancer Patients, MD Anderson, 1998 to 2012

Motivation of Dynamic Treatment Regimes (DTRs)
Chronic Disease management: routinely adjust, change, add, or
discontinue treatment based on progress, side eﬀects, patient burden,
compliance, etc.
→ Dynamic Treatment Regimes (DTRs)
Goals: To account for patient heterogeneity and to personalize health
care strategy:
One Size Fits All
Degree of Tailoring
−−−−−−−−−−−−−−→
Inherent Characteristics
Targeted/Individualized
Once and for All
Degree of Tailoring
−−−−−−−−−−−−−−−−−−→
Time-Varying Characteristics
Dynamic

Evidence-based Personalized Health Care
To provide meaningful improved health outcomes for patients by
delivering the right drug with the right dose at the right time.
One size ﬁts all Individualized TherapyDegree of Tailoring
Low predictability
Suboptimal health outcomes
Asses patient response heterogeneity;
Stratify patient population;
Optimize beneﬁts/risk, etc.
Better predictability
Improved health outcomes
Family History
Medical
Genetics
Lifestyle
Environment
Engine

Key Notation
Stage j, j = 1, . . . , T with treatment options Kj (≥ 2)
Aj: treatment at stage j with value aj ∈ Aj = {1, . . . , Kj}
Xj: covariate history just prior to assign Aj
Y : overall outcome of interest
DTR g = (g1, . . . , gT ): a set of decision rules with
gj : Domain of Hj = (A1, . . . , Aj−1, Xj ) → Aj.
Y ∗(a): counterfactual outcome under treatment a

Assumptions for Identiﬁability using Observational Data
Use causal framework to assess DTRs under three assumptions
(Murphy et al. 2001; Robins and Hernan, 2009)
1 Consistency (no interference)
If a patient’s observed treatment history is compatible with a
given DTR, his or her observed clinical outcomes are the same as
the counterfactual ones under the DTR.
2 No Unmeasured Confounding
Treatment decision at a given time is independent of future
observations and counterfactual outcomes, conditional on all
previous covariate and treatment history.
3 Positivity
With probability one, every subject follows a speciﬁc DTR with a
nonnegative probability, which is bounded away from 0.

Adaptive Contrast Weighted Learning (ACWL)
Tao and Wang, 2017 Biometrics
For T = 1:
gopt
= arg max
g∈G
EH
K
a=1
E{Y ∗
(a)|H}I{g(H) = a} .
Under the Causal Assumptions 1-3, we have
gopt
= arg max
g∈G
EH
K
a=1
µa(H)I{g(H) = a} ,
where µa(H) E(Y |A = a, H).
Incorporate order statistics µ(1)(H) ≤ · · · ≤ µ(K)(H), then
gopt
= arg max
g∈G
EH
K
a=1
µ(a)(H)I{g(H) = la(H)} ,
where µ(a)(H) = µla (H)

Property of gopt
gopt
= arg min
g∈G
EH
K−1
a=1
{µ(K)(H) − µ(a)(H)}I{g(H) = la(H)} .
Minimizes the expected loss in the outcome due to sub-optimal
treatments in the entire population of interest.
Classify as many patients as possible to their optimal treatment
lK (i.e., letting I{g(H) = la(H)} = 0, a = 1, . . . , K − 1) while
putting more emphasis on patients with larger contrasts (i.e.,
larger µ(K)(H) − µ(a)(H)).

Bounds of the Objective Function
Let C1(H) µ(K)(H) − µ(K−1)(H), and C2(H) µ(K)(H) − µ(1)(H),
then
0 ≤ C1(H) ≤ µ(K)(H) − µ(a)(H) ≤ C2(H)
Therefore,
EH [C1(H)I{g(H) = lK(H)}]
≤ EH
K−1
a=1
{µ(K)(H) − µ(a)(H)}I{g(H) = la(H)}
≤ EH [C2(H)I{g(H) = lK(H)}]

Adaptive Contrasts for Transformation to Weighted
Classiﬁcation
Contrasts C1 and C2: minimum and maximum expected losses in
the outcome given sub-optimal treatments, adaptive to each
patient’s own treatment eﬀect ordering
In the best case where sub-optimal treatments only lead to
minimal expected losses in the outcome, gopt is equal to
arg min
g∈G
EH [C1(H)I{g(H) = lK(H)}] ,
In the worst case where sub-optimal treatments all lead to
maximal expected losses in the outcome, gopt is equal to
arg min
g∈G
EH [C2(H)I{g(H) = lK(H)}] .

Estimate µA(H) to Construct C1(H) and C2(H)
Given estimated propensity score ˆπa(H), the AIPW estimator
ˆµAIPW
a for E{Y ∗(a)} is
ˆµAIPW
a = Pn
I(A = a)
ˆπa(H)
Y + 1 −
I(A = a)
ˆπa(H)
ˆµa(H) .
Lemma (Double Robustness)
ˆµAIPW
a is a consistent estimator of E{Y ∗(a)} if either the propensity
model πa(H) or the conditional mean model µa(H) is correctly speciﬁed.
At a subject level,
ˆµAIPW
a (H) =
I(A = a)
ˆπa(H)
Y + 1 −
I(A = a)
ˆπa(H)
ˆµa(H).

Adaptive Contrast Weighted Learning (ACWL) with T > 1
For stage T, the assumptions and method derivation are the same
as in a single stage scenario.
For stage j, T − 1 ≥ j ≥ 1, we incorporate Q-learning to conduct
backward induction. The stage-speciﬁc pseudo outcome POj for
estimating treatment eﬀect ordering and adaptive contrasts is a
predicted counterfactual outcome under optimal treatments at all
future stages, i.e.,
POj = E Y ∗
(A1, . . . , Aj, gopt
j+1, . . . , gopt
T ) .
Replace Y with POj to apply the same method as in T = 1.

Simulation Results
Q-learning BOWL ACWL
Stage1Stage2
Q-learning BOWL ACWL

Tree-based Reinforcement Learning (T-RL): Motivation
Tao, Wang and Armirall D. (2018) Annals of Applied Statistics
Supervised learning (SL):
True label is known
Then tree-based methods are easy to understand and interpret
without distributional assumptions for data, e.g., Classification
and Regression Tree (CART)
In a DTR problem, label (the optimal treatment) at each stage is not
directly observed
ACWL uses semiparametric regression to estimate the label first
and then convert RL to SL to apply existing classification methods
such as CART

T-RL: Methodological Basis and Feature
Batch-mode RL using a sequence of unsupervised decision trees with
backward induction
Maintain the RL feature without the extra step of conversion to
SL, and thus reduce estimation uncertainty (improvement upon
ACWL)
Incorporate multiple treatment comparisons directly, to improve
eﬃciency (improvement upon ACWL)
Also allow multi-categorical and ordinal treatments
Similar to ACWL, we expect T-RL to be robust and eﬃcient by
combining robust semiparametric regression estimators with
nonparametric machine learning methods.

Example of an Unsupervised Decision Tree
The selected split at each node must improve the expected
counterfactual outcome.
Nodes Ωm, m = 1, 2, . . . , are regions deﬁned by subset covariate space.

Tree-based RL (T-RL): Purity Measures
Use the purity measure to improve the expected counterfactual
outcome and apply doubly robust AIPW estimators as in ACWL.
For a given partition ω and ωc of node Ω, let gj,ω,a1,a2 denote the
decision rule that assigns treatment a1 to subjects in ω and
treatment a2 to subjects in ωc at stage j.
We deﬁne the purity measure Pj(Ω, ω) as
max
a1,a2∈Aj
Pn


Kj
aj=1
ˆµAIPW
j,aj
(Hj)I{gj,ω,a1,a2 (Hj) = aj}I(Hj ∈ Ω)

 .
Pj(Ω, ω) evaluates the performance of the best decision rule which
assigns a single treatment for each of the two arms under partition.

T-RL: Recursive Partitioning
The best split ωopt is chosen to maximize the improvement in the
purity, Pj(Ω, ω) − Pj(Ω).
Stopping Rules, given minimum node size n0 and minimum
purity improvement λ
1 If the node size is less than 2n0, the node will not be split.
2 If all possible splits of a node result in a child node with size smaller
than n0, the node will not be split.
3 If the current tree depth reaches the user-speciﬁed maximum depth,
the tree growing process will stop.
4 If the maximum purity improvement Pj(Ω, ˆωopt
) − Pj(Ω) is less
than λ, the node will not be split, where
ˆωopt
= arg max
ω
[Pj(Ω, ω) : min{nPnI(Hj ∈ ω), nPnI(Hj ∈ ωc
)} ≥ n0]

T-RL Algorithm
1 Start with stage j = T.
2 Obtain AIPW estimates ˆµAIPW
j,aj
(Hj), aj = 1, . . . , Kj,.
3 At root node Ωj,m, m = 1, set values for λ and n0.
4 At node Ωj,m, evaluate the four Stopping Rules. If any of them
is satisﬁed, assign a single best treatment
arg max
aj∈Aj
Pn[ˆµAIPW
j,aj
(Hj)I(Hj ∈ Ωj,m)]
to all subjects in Ωj,m. Otherwise, split Ωj,m into child nodes
Ωj,2m and Ωj,2m+1 by ˆωopt.
5 Set m = m + 1 and repeat Step 4 until all nodes are terminal.
6 If j > 1, set j = j − 1 and repeat steps 2 to 5. If j = 1, stop.

Simulation 1: Tree-type DTR

Simulation 2: Non-tree-type DTR

Summary
ACWL transforms batch-mode RL to SL and incorporates existing
classification methods: flexible and easy to implement.
T-RL directly deals with batch-mode RL and uses unsupervised
trees to learn optimal DTRs with purity measures maximizing the
counterfactual mean outcome.
ACWL and T-RL both combines semiparametric regression with
machine learning and can handle multiple stages and multiple
treatments.
Overall, both methods work well across different scenarios. T-RL
has slightly better robustness. T-RL works better for tree-type
DTRs while ACWL works better for non-tree-type ones.

Other Ongoing Research Projects
Stochastic Tree-based Reinforcement Learning (ST-RL) for
estimating optimal DTRs
Explore diﬀerent structures of optimal DTRs
with restricted arms from observational data
with restricted tailoring variables
Some special DTR frameworks: e.g. Nested Test-and-Treat DTRs
Multivariate utility function: multiple competing clinical priorities
Continuous treatment options, e.g., radiation doses
Mobile health: continuous decisions and data updating

Acknowledgment
My Ph.D. students:
Yebin Tao, Yilun Sun, Nina Zhou, Ming Tang, Kelly Speth,
Jincheng Shen, Mochuan Liu, Yingchao Zhong.
My collaborators:
Daniel Almirall, Jeremy Taylor, Peter Thall, Stewart Wang.
Thank You!
luwang@umich.edu

Causal Inference Opening Workshop - New Statistical Learning Methods for Estimating the Optimal Dynamic Treatment Regime - Lu Wang, December 10, 2019

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Similar to Causal Inference Opening Workshop - New Statistical Learning Methods for Estimating the Optimal Dynamic Treatment Regime - Lu Wang, December 10, 2019

Similar to Causal Inference Opening Workshop - New Statistical Learning Methods for Estimating the Optimal Dynamic Treatment Regime - Lu Wang, December 10, 2019 (20)

More from The Statistical and Applied Mathematical Sciences Institute

More from The Statistical and Applied Mathematical Sciences Institute (20)

Recently uploaded

Recently uploaded (20)

Causal Inference Opening Workshop - New Statistical Learning Methods for Estimating the Optimal Dynamic Treatment Regime - Lu Wang, December 10, 2019