Successfully reported this slideshow.
Upcoming SlideShare
×

# Off policy evaluation

Off policy evaluation from a causal inference perspective

• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

### Off policy evaluation

1. 1. Oﬀ policy evaluation -survey- 1 Masatoshi Uehara (Harvard University) December 25, 2019 1 Disclaimer; this is a very casual note Masatoshi Uehara (Harvard University) OPE December 25, 2019 1 / 50
2. 2. Overview 1 Motivation 2 Contextual bandit setting (With parametric models) 3 Bandit setting (With nonparametric models) 4 RL setting (Sequential or longitudinal setting) 5 Open Problems (General DAG, Mediation, Interference) Masatoshi Uehara (Harvard University) OPE December 25, 2019 2 / 50
3. 3. Oﬀ policy evaluation (OPE) Goal is evaluating the value of the policy from the historical data. More formally, estimating the value of the evaluation policy πe from the data obtained by the behavior policy πb. Masatoshi Uehara (Harvard University) OPE December 25, 2019 3 / 50
4. 4. Some notations from semiparametric theory Refer to (van der Vaart, 1998; Bickel et al., 1998; Tsiatis, 2006; Kennedy, 2016) (Semiparametric models)... Combination of parametric and nonparametric models (Semiparametric eﬃciency bound)....Extension of Cramer-Rao lower bound for parametric models to semiparametric models. (Inﬂuence function (IF) of the estimator and estimand)... φ(x) for ˆθ or θ∗ √ N(ˆθ − θ∗ ) = 1 √ N N i=1 φ(x(i) ) + op(1/ √ N) (Eﬃcient inﬂuence function (EIF))... IF of the estimand minimizing the variance. (Eﬃcient estimator).... Estimator achieving the eﬃciency bound Masatoshi Uehara (Harvard University) OPE December 25, 2019 4 / 50
5. 5. Contextual bandit setting Setting We have {s(i), a(i), r(i)}N i=1 ∼ p(s)πb(a|s)p(r|s, a). We want to estimate Eπe [r] = Ep[rπe (a|s)] = rp(s)πe (a|s)p(r|s, a)dµ(r, s, a) . Good surveys (Rotnitzky and Vansteelandt, 2014; Seaman and Vansteelandt, 2018; Huber, 2019; Diaz, 2019) Unless otherwise noted, the expectation is taken w.r.t behavior policy Extension to conterfactual setting is easy EN[·] Empirical approximation Value function and Q-functions are deﬁned for evalution policies Masatoshi Uehara (Harvard University) OPE December 25, 2019 5 / 50
6. 6. CB; Semiparemtric Lower bound The eﬃciency bound under nonparametric model is var{v(s)} + E{η(s, a)2 var(r|s, a)}, where E(r|s, a) = q(s, a) and Eπe {E(r|s, a)|s} = v(s), η(s, a) = πe/πb. How to obtain? Approximate your inﬁnite dimensional model as a parametric model. Then, calculate the supremum of the Cramer-Rao lower bound. Masatoshi Uehara (Harvard University) OPE December 25, 2019 6 / 50
7. 7. Implication of semiparametric lower bound Semiparametric lower bound gives the lower bound of asymptotic MSE among regular estimators. Therefore, for example, var{v(s)} + E{η(s, a)2 var(r|s, a)} < var{η(s, a)r}. Importantly, this lower bound is not changed whether behavior policy is known or not. Masatoshi Uehara (Harvard University) OPE December 25, 2019 7 / 50
8. 8. Common estimators IS (Importance sampling a.k.a IPW, HorvitzThompson); EN [ˆη(s, a)r] , πe(a|s) πb(a|s) = η(s, a) NIS (Normalized IS); EN [ˆη(a, s)r/EN[ˆη(s, a)]] DM (Direct method); EN[ˆq(s, a)], (E[r|a, s] = q(s, a)) AIS (Augmented IS (Robins et al., 1994; Dudik et al., 2014)); EN [ˆη(a, s)(r − ˆq(s, a)) + ˆv(s)] , (ˆv(s) = E[ˆq(s, a) | s]) Masatoshi Uehara (Harvard University) OPE December 25, 2019 8 / 50
9. 9. Useful properties for AIS 1 Model double robustness (In terms of consistency and √ N–consistency) η(s, a) ≈ ˆη(s, a)? q(s, a) ≈ ˆq(s, a)? Red... Consistent, Green.... Not Consistent Masatoshi Uehara (Harvard University) OPE December 25, 2019 9 / 50
10. 10. Useful properties for AIS 2 Rate double robustness ˆη − η 2 = op(N−1/4) and ˆq − q 2 = op(N−1/4) are suﬃcient conditions to guarantee the eﬃciency (Chernozhukov et al., 2018; Rotnitzky and Smucler, 2019) Fact regarding plug-in Even if nuisance functions are estimated with parametric √ N–rate, the asymptotic variance will be generally changed Thanks to the orthogonality of IF, the asymptotic variance is not changed even if there is plug-in (Rotnitzky et al., 2019) Masatoshi Uehara (Harvard University) OPE December 25, 2019 10 / 50
11. 11. Double robust IS or Double robust direct estimator Double robust regression estimator (Scharfstein et al., 1999; Kang and Schafer, 2007) Learn q(s, a) with some covariate including ˆη(s, a) (weighted regression). Deﬁne an estimator as EN[ˆq(s, a)] This is double robust!! Close to TMLE Double robust IS estimator (Robins et al., 2007) Learn η(s, a) with some covariate based on ˆq(s, a) Deﬁne an IS estimator EN[ˆη(s, a)r] This is double robust Close to TMLE Masatoshi Uehara (Harvard University) OPE December 25, 2019 11 / 50
12. 12. More doubly robust (MDR) estimator Motivation...AIS has poor performance when q(s, a) is mis-speciﬁed (Rubin and van Der Laan, 2008; Cao et al., 2009) MDR MDR is minimizing the variance among some class of estimators irrespective of the model-speciﬁcation of q(s, a) When behavior policy is known, Q-function is estimated as follows; ˆq = arg min q∈Fq var{v(s)} + E{η(s, a)2 var(r|s, a)} . Then, plug it in DR. (Property)... Still double robust Can be extended when behavior policy is unknown. Extension to RL (Farajtabar et al., 2018) Masatoshi Uehara (Harvard University) OPE December 25, 2019 12 / 50
13. 13. Intrinsic eﬃcient estimator Motivation...The performance of AIS can become worse than IS or NIS (when q-models are mis-speciﬁed). Intrinsic eﬃcient estimator (Tan, 2006, 2010) Making the class of estimator including IS and NIS, and optimizing so that the variance is minimized (Property)... Still double robust and better than IS and NIS Extension to RL (Kallus and Uehara, 2019c) Masatoshi Uehara (Harvard University) OPE December 25, 2019 13 / 50
14. 14. Bias reduced estimator (Vermeulen and Vansteelandt, 2015) (Motivation)...What will happen when both models are mis-speciﬁed? Vermeulen and Vansteelandt (2015) has introduced an estimator based on the idea of reducing MSE irrespective of model-speciﬁcations. (Property)...Double robust and robust to model-misspeciﬁcations!! Masatoshi Uehara (Harvard University) OPE December 25, 2019 14 / 50
15. 15. Nonparametric IS (Hirano et al., 2003) IS when πb is estimated nonparametrically This achieves the eﬃciency bound under some smoothness conditions Plug-in paradox (Robins et al., 1992; Henmi and Eguchi, 2004; Henmi et al., 2007) Plug-in estimator based on MLE is more eﬃcient than non plug-in estimator If so, is plug-in IS estimator better than no plug-in estimator? Yes; If models are well-speciﬁed. Kind of using some control variate (Robins et al., 2007) No; If models are mis-speciﬁed Masatoshi Uehara (Harvard University) OPE December 25, 2019 15 / 50
16. 16. Nonoparametric direct method Hahn (1998) introduce an estimator based on a direct method when q(a, x) is estimated nonparametrically This achieves the eﬃciency bound under some smoothness conditions Parametric direct method A.K.A G-formula (Hernan and Robins, 2019) We can also assume a parametric model for q(a, x) directly (semiparametric direct method). Eﬃciency bound under parametric q-model is smaller than eﬃciency bound under nonparametric model (Tan, 2007) Masatoshi Uehara (Harvard University) OPE December 25, 2019 16 / 50
17. 17. Double debiased machine learning (Chernozhukov et al., 2018) The estimator is EN [ˆµ(a, s)(r − ˆq(s, a)) + ˆv(s)] , (Eπe [r|s] = v(s)) with cross ﬁtting (aka. sample splitting) (van der Vaart, 1998) Both µ and q are estimated nonparametrically. Rate double robustness is attained without Donsker conditions for nuisance estimators Masatoshi Uehara (Harvard University) OPE December 25, 2019 17 / 50
18. 18. TMLE (Rubin, 2006; van der Laan, 2011; Benkeser et al., 2017) TMLE??... Updating the estimator based on the eﬃcient inﬂuence function of the target. (Super-learner is also used here) When EIF is analytically written, TMLE is reduced to a one-step estimator. See Page 11. When EIF does not have a closed form, iterative estimator. Corraborative double robustness (van Der Laan and Gruber, 2010; Diaz, 2018) Masatoshi Uehara (Harvard University) OPE December 25, 2019 18 / 50
19. 19. Other important estimators Switching estimator (Tsiatis and Davidian, 2007; Wang et al., 2017) Matching estimator (Abadie and Imbens, 2006; Wang and Zubizarreta, 2019a) Covariate balancing with various divergences (Imai and Ratkovic, 2014; Wang and Zubizarreta, 2019b) Minimax estimator (Kallus, 2018; Chernozhukov et al., 2018; Hirshberg and Wager, 2019) High dimensional setting (Many... E.g. Farrell (2015); Smucler et al. (2019)) Continuous treatment (estimand is the diﬀerence) (Kennedy et al., 2017) Finite population inference (Bojinov and Shephard, 2019) Multiple robustness (Rotnitzky et al., 2017) Masatoshi Uehara (Harvard University) OPE December 25, 2019 19 / 50
20. 20. RL setting (Application) Figure: ADHD Example [Chakraborty,2009] Masatoshi Uehara (Harvard University) OPE December 25, 2019 20 / 50
21. 21. Summary of RL situation Table: Eﬃciency bounds and estimators for OPE Eﬃciency bound Eﬃcient estimator NMDP Kallus and Uehara (2019a) Jiang and Li (2016) Thomas and Brunskill (2016) TMDP Kallus and Uehara (2019a) Kallus and Uehara (2019a) MDP Kallus and Uehara (2019b) Kallus and Uehara (2019b) Jiang and Li (2016) also calculated bounds of NMDP and TMDP for a tabular case. Note that eﬃciency bound and estimator under NMDP are kind of given in causal inference literature (Murphy, 2003; van Der Laan and Robins, 2003; Bang and Robins, 2005) Masatoshi Uehara (Harvard University) OPE December 25, 2019 21 / 50
22. 22. MDP MDP = {S, A, R, p} S, A, R... State space, Action space, Reward space Transition density... p(s |s, a) Reward distribution.... p(r|s, a) Initial distribution.... p(0) (s0) Evaluation policy πe(a|s), behavior policy πb(a|s) The induced distribution by MDP and the behavior policy is p(s0, a0, r0, a0, s1, a1, r1, s2, a2, r2, · · · ) = p(0) (s0)πb (a0|s0)p(r0|s0, a0)p(s1|s0, a0)πb (a1|s1)p(r1|s1, a1) · · · . s0 a0 r0 s1 a1 r1 s2 | | | | || || || || ||| ||| Figure: MDP Masatoshi Uehara (Harvard University) OPE December 25, 2019 22 / 50
23. 23. NMDP and TMDP MDP can be relaxed into two ways; NMDP (without Markovness) and TMDP (Without time-invariance) Figure: NMDP (Non-Markov Decision process) Figure: TMDP (Time-varying Markov Decision Process) Masatoshi Uehara (Harvard University) OPE December 25, 2019 23 / 50
24. 24. Goal in OPE for RL [Goal]; Estimate ρπe ; ρπe = (1 − γ) ∞ t=0 Eπe [γt rt], (γ < 1) Note that this expectation is taken w.r.t p (0) e (s0)πe (a0|s0)p(r0|s0, a0)p(s1|s0, a0)πe (a1|s1)p(r1|s1, a1) · · · . We can use a set of samples generated by MDP and the behavior policy πb; {s (i) t , a (i) t , r (i) t }N,T i=1,t=0. following p (0) b (s0)πb (a0|s0)p(r0|s0, a0)p(s1|s0, a0)πb (a1|s1)p(r1|s1, a1) · · · . Masatoshi Uehara (Harvard University) OPE December 25, 2019 24 / 50
25. 25. Common three approaches DM (Direct Method) estimator ˆρDM = (1 − γ) EN [Eπe [ˆq(s0, a0)|s0]] , where E[ ∞ t=0 γtrt|s0, a0] = q(s0, a0). SIS (Sequential Importance Sampling) estimator ˆρSIS = (1 − γ) EN T t=0 γt νtrt , where νt(Hat ) = t k=0 ηk(sk, ak), ηk(sk, ak) = πe(ak|sk) πb(ak|sk) . Double Robust (DR) estimator (Jiang and Li, 2016; Thomas and Brunskill, 2016) ˆρDR = (1 − γ) EN T t=0 γt (νt(rt − ˆqt) + νt−1 ˆvt(st)) . Masatoshi Uehara (Harvard University) OPE December 25, 2019 25 / 50
26. 26. Curse of horizon Eπe [ T t=0 γt rt] = Eπb T k=0 πe(ak|sk) πb(ak|sk) T t=0 γt rt = Eπb T t=0 t k=0 πe(ak|sk) πb(ak|sk) γt rt = Eπb T t=0 γt νtrt ≈ EN T t=0 γt νtrt Problem: Variance grows exponentially w.r.t T. Masatoshi Uehara (Harvard University) OPE December 25, 2019 26 / 50
27. 27. Curse of horizon SIS and DR estimator suﬀer from the curse of horizon DM estimator does not. But it suﬀer from the model misspeﬁcation. Q; Are there any solutions? A; MDP assumptions are not exploited fully. Markov assumption Time-invariant assumption Masatoshi Uehara (Harvard University) OPE December 25, 2019 27 / 50
28. 28. Leveraging Markovness Xie et al. (2019) proposed a marginal importance sampling estimator; Eπe T t=0 γt rt = Eπb T t=0 t k=0 πe(ak|sk) πb(ak|sk) γt rt = Eπb T t=0 γt µtrt ≈ EN T t=0 γt µtrt . Here, µt is a marginal density ratio at t; µt = pπe (st, at) pπb (st, at) . Masatoshi Uehara (Harvard University) OPE December 25, 2019 28 / 50
29. 29. Eﬃciency bound under NMDP and TMDP Theorem (EB under NMDP) EB(M1) = (1 − γ)2 ∞ k=1 E[γ2(k−1) ν2 k−1var rk−1 + vk|Hak−1 ]. Theorem (EB under TMDP) EB(M2) = (1 − γ)2 ∞ k=1 E[γ2(k−1) µ2 k−1var (rk−1 + vk|ak−1, sk−1)]. Typical behavior of ν2 k−1 is O(Ck). Typical behavior of µ2 k−1 is O(1). Variance does not grow exponentially w.r.t T under TMDP Masatoshi Uehara (Harvard University) OPE December 25, 2019 29 / 50
30. 30. Double Reinforcement learning (for TMDP) Kallus and Uehara (2019a) has proposed an estimator (DRL) achieving the eﬃciency bound under TMDP; ˆρDRL(M2) = (1 − γ) EN T t=0 γt (ˆµt(rt − ˆqt) + ˆµt−1 ˆvt(st)) . Masatoshi Uehara (Harvard University) OPE December 25, 2019 30 / 50
31. 31. Double robustness of DRL for TMDP Model double robustness (Also, rate double robustness) µt(s, a) ≈ ˆµt(s, a)? qt(s, a) ≈ ˆqt(s, a)? Red... Consistent, Green.... Not Consistent Masatoshi Uehara (Harvard University) OPE December 25, 2019 31 / 50
32. 32. Curse of horizon (Again) Q: Is the curse of horizon solved? A: At least, it does not blow up w.r.t horizon. But, rate is not right under MDP Masatoshi Uehara (Harvard University) OPE December 25, 2019 32 / 50
33. 33. Correct rate for OPE under Ergodic MDP The rate of the estimator (MSE) introduced so far is 1/N However, we can learn the estimand with 1/NT–rate assuming Ergodicity. Importantly, we can learn from a single trajectory (N = 1, T → ∞) Masatoshi Uehara (Harvard University) OPE December 25, 2019 33 / 50
34. 34. Leveraging Time-invariance Liu et al. (2018) proposed an Ergodic importance sampling estimator; lim T→∞ (1 − γ)Eπe T t=0 γt rt = rp∞ e,γ(s, a)dµ(s, a, r) = r p∞ e,γ(s, a) p∞ b (s, a) p∞ b (s, a)dµ(s, a, r) = Eπ∞ b [rw(s, a)] ≈ ENET [rw(s, a)] = 1 N 1 T N i=1 T t=1 r (i) t w(s (i) t , a (i) t ) where p∞ e,γ(s, a) is an average visitation distribution of state and action, w(s, a) = p∞ e,γ(s, a) p∞ b (s, a) . Masatoshi Uehara (Harvard University) OPE December 25, 2019 34 / 50
35. 35. Eﬃciency bound under Ergodic MDP The lower bound of asymptotic MSE scaled by NT among regular estimators is EB(M3) = Ep (∞) b   w2 (s, a) Distribution mismatch {r + γv(s ) − q(s, a)}2 Bellman squared residual    . Table: Comparison regarding rate Rate Curse of horizon NMDP O(1/N) Yes TMDP O(1/N) No. But still rate... MDP O(1/NT) No Masatoshi Uehara (Harvard University) OPE December 25, 2019 35 / 50
36. 36. Eﬃcient estimator under ergodic MDP Deﬁning v(s0) = Eπe [q(s0, a0)|s0], eﬃcient estimator ˆρDRL(M3) is deﬁned as follows; (1 − γ)ENEp (0) e [ˆv(s0)] + ENET [ ˆw(s, a)(r + γˆv(s ) − ˆq(s, a))], or ENET [ ˆw(s, a)r] + (1 − γ)ENEp (0) e [ˆv(s0)] + ENET [ ˆw(s, a)(r + γˆv(s ) − ˆq(s, a))]. Here, Red terms correspond to IS estimator or DM estimator. And, Blue terms correspond to control variates. Masatoshi Uehara (Harvard University) OPE December 25, 2019 36 / 50
37. 37. Double robustness of DRL for MDP Model double robustness (Also, rate double robustness) w(s, a) ≈ ˆw(s, a)? q(s, a) ≈ ˆq(s, a)? Red... Consistent, Green.... Not Consistent Masatoshi Uehara (Harvard University) OPE December 25, 2019 37 / 50
38. 38. General causal DAG (when all of variables are measured) Given causal DAG (FFRCISTG), G-formula or IS estimator give identiﬁcation formulas (Hernan and Robins, 2019). How to obtain an eﬃcient estimator? ... See van Der Laan and Robins (2003) The problem is how the estimator can be simpliﬁed (Rotnitzky and Smucler, 2019). Masatoshi Uehara (Harvard University) OPE December 25, 2019 38 / 50
39. 39. General causal DAG (With unmeasured variables) ID algorithm (Shpitser and Pearl, 2008; Tian, 2008; Shpitser and Sherman, 2018) gives a suﬃcient and necessary identiﬁcation formula. The relation with eﬃcient estimation is a still opening problem?? Figure: With unmeasured confounding Masatoshi Uehara (Harvard University) OPE December 25, 2019 39 / 50
40. 40. Mediation eﬀect (Pathway eﬀect) Modiﬁed ID algorithm (Shpitser and Sherman, 2018) is a suﬃcient and necessary identiﬁcation formula. Eﬃcient theory is still being constructed (Nabi et al., 2018). Figure: Edge intervention Masatoshi Uehara (Harvard University) OPE December 25, 2019 40 / 50
41. 41. Network, interference Some estimation method and its theory(Ogburn et al., 2017). Chain graph is also useful for network setting (Ogburn et al., 2018) Figure: Chain Graph Identiﬁcation formula is given (Sherman and Shpitser, 2018) Since each unit is not i.i.d, diﬃcult E.g. To estimate from a single network, ergodicity is needed. Ordinary semiparmatric theory assume i.i.d. Masatoshi Uehara (Harvard University) OPE December 25, 2019 41 / 50
42. 42. Ref I Abadie, A. and G. W. Imbens (2006). Large sample properties of matching estimators for average treatment eﬀects. Econometrica 74, 235–267. Bang, H. and J. M. Robins (2005). Doubly robust estimation in missing data and causal inference models. Biometrics 61, 962–973. Benkeser, D., M. Carone, M. J. V. D. Laan, and P. B. Gilbert (2017). Doubly robust nonparametric inference on the average treatment eﬀect. Biometrika 104, 863–880. Bickel, P. J., C. A. J. Klaassen, Y. Ritov, and J. A. Wellner (1998). Eﬃcient and Adaptive Estimation for Semiparametric Models. Springer. Bojinov, I. and N. Shephard (2019). Time series experiments and causal estimands: exact randomization tests and trading. Journal of the American Statistical Association. Cao, W., A. A. Tsiatis, and M. Davidian (2009). Improving eﬃciency and robustness of the doubly robust estimator for a population mean with incomplete data. Biometrika 96, 723–734. Chernozhukov, V., D. Chetverikov, M. Demirer, E. Duﬂo, C. Hansen, W. Newey, and J. Robins (2018). Double/debiased machine learning for treatment and structural parameters. Econometrics Journal 21, C1–C68. Masatoshi Uehara (Harvard University) OPE December 25, 2019 42 / 50
43. 43. Ref II Chernozhukov, V., W. Newey, J. Robins, and R. Singh (2018). Double/de-biased machine learning of global and local parameters using regularized riesz representers. arXiv.org. Diaz, I. (2018). Doubly robust estimators for the average treatment eﬀect under positivity violations: introducing the e-score. arXiv.org. Diaz, I. (2019). Machine learning in the estimation of causal eﬀects: targeted minimum loss-based estimation and double/debiased machine learning. Biostatistics. Dudik, M., D. Erhan, J. Langford, and L. Li (2014). Doubly robust policy evaluation and optimization. Statistical Science 29, 485–511. Farajtabar, M., Y. Chow, and M. Ghavamzadeh (2018). More robust doubly robust oﬀ-policy evaluation. In Proceedings of the 35th International Conference on Machine Learning, 1447–1456. Farrell, M. H. (2015). Robust inference on average treatment eﬀects with possibly more covariates than observations. Journal of Econometrics 189, 1–23. Hahn, J. (1998). On the role of the propensity score in eﬃcient semiparametric estimation of average treatment eﬀects. Econometrica 66, 315–331. Henmi, M. and S. Eguchi (2004). A paradox concerning nuisance parameters and projected estimating functions. Biometrika 91, 929–941. Masatoshi Uehara (Harvard University) OPE December 25, 2019 43 / 50
44. 44. Ref III Henmi, M., R. Yoshida, and S. Eguchi (2007). Importance sampling via the estimated sampler. Biometrika 94, 985–991. Hernan, M. and J. Robins (2019). Causal Inference. Boca Raton: Chapman & Hall/CRC. Hirano, K., G. Imbens, and G. Ridder (2003). Eﬃcient estimation of average treatment eﬀects using the estimated propensity score. Econometrica 71, 1161–1189. Hirshberg, D. and S. Wager (2019). Augmented minimax linear estimation. arXiv.org. Huber, M. (2019). An introduction to ﬂexible methods for policy evaluation. arXiv.org. Imai, K. and M. Ratkovic (2014). Covariate balancing propensity score. J. R. Statist. Soc. B 76, 243–263. Jiang, N. and L. Li (2016). Doubly robust oﬀ-policy value evaluation for reinforcement learning. In Proceedings of the 33rd International Conference on International Conference on Machine Learning-Volume, 652–661. Kallus, N. (2018). Balanced policy evaluation and learning. In Advances in Neural Information Processing Systems 31, pp. 8895–8906. Kallus, N. and M. Uehara (2019a). Double reinforcement learning for eﬃcient oﬀ-policy evaluation in markov decision processes. arXiv preprint arXiv:1908.08526. Masatoshi Uehara (Harvard University) OPE December 25, 2019 44 / 50
45. 45. Ref IV Kallus, N. and M. Uehara (2019b). Eﬃciently breaking the curse of horizon: Double reinforcement learning in inﬁnite-horizon processes. arXiv preprint arXiv:1909.05850. Kallus, N. and M. Uehara (2019c). Intrinsically eﬃcient, stable, and bounded oﬀ-policy evaluation for reinforcement learning. In Advances in Neural Information Processing Systems 32, pp. 3320–3329. Kang, J. D. Y. and J. L. Schafer (2007). Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data. Statistical Science 22, 523–529. Kennedy, E. (2016). Semiparametric theory and empirical processes in causal inference. arXiv.org. Kennedy, E. H., Z. Ma, M. D. Mchugh, and D. S. Small (2017). Nonparametric methods for doubly robust estimation of continuous treatment eﬀects. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 79, 1229–1245. Liu, Q., L. Li, Z. Tang, and D. Zhou (2018). Breaking the curse of horizon: Inﬁnite-horizon oﬀ-policy estimation. In Advances in Neural Information Processing Systems 31, pp. 5356–5366. Murphy, S. A. (2003). Optimal dynamic treatment regimes. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 65, 331–355. Masatoshi Uehara (Harvard University) OPE December 25, 2019 45 / 50
46. 46. Ref V Nabi, R., P. Kanki, and I. Shpitser (2018). Estimation of personalized eﬀects associated with causal pathways. Uncertainty in artiﬁcial intelligence : proceedings of the ... conference. Conference on Uncertainty in Artiﬁcial Intelligence 2018. Ogburn, E., O. Sofrygin, and I. Diaz (2017). Causal inference for social network data. arXiv.org. Ogburn, E. L., I. Shpitser, and Y. Lee (2018). Causal inference, social networks, and chain graphs. arXiv.org. Robins, J., M. Sued, Q. Lei-Gomez, and A. Rotnitzky (2007). Comment: Performance of double-robust estimators when ”inverse probability” weights are highly variable. Statistical Science 22, 544–559. Robins, J. M., S. D. Mark, and W. K. Newey (1992). Estimating exposure eﬀects by modelling the expectation of exposure conditional on confounders. Biometrics 48, 479–495. Robins, J. M., A. Rotnitzky, and L. P. Zhao (1994). Estimation of regression coeﬃcients when some regressors are not always observed. Journal of the American Statistical Association 89, 846–866. Rotnitzky, A., J. Robins, and L. Babino (2017). On the multiply robust estimation of the mean of the g-functional. arXiv preprint arXiv:1705.08582. Masatoshi Uehara (Harvard University) OPE December 25, 2019 46 / 50
47. 47. Ref VI Rotnitzky, A. and E. Smucler (2019). Eﬃcient adjustment sets for population average treatment eﬀect estimation in non-parametric causal graphical models. arXiv.org. Rotnitzky, A., E. Smucler, and J. Robins (2019). Characterization of parameters with a mixed bias property. arXiv preprint arXiv:1509.02556. Rotnitzky, A. and S. Vansteelandt (2014). Double-robust methods. In Handbook of missing data methodology. In Handbooks of Modern Statistical Methods, pp. 185–212. Chapman and Hall/CRC. Rubin, D. (2006). Targeted maximum likelihood learning. The International Journal of Biostatistics 2, 1043–1043. Rubin, D. B. and M. J. van Der Laan (2008). Empirical eﬃciency maximization: improved locally eﬃcient covariate adjustment in randomized experiments and survival analysis. The international journal of biostatistics 4. Scharfstein, D., A. Rotnizky, and J. M. Robins (1999). Adjusting for nonignorable dropout using semi-parametric models. Journal of the American Statistical Association 94, 1096–1146. Seaman, S. R. and S. Vansteelandt (2018). Introduction to double robust methods for incomplete data. Statistical science 33, 184–197. Masatoshi Uehara (Harvard University) OPE December 25, 2019 47 / 50
48. 48. Ref VII Sherman, E. and I. Shpitser (2018). Identiﬁcation and estimation of causal eﬀects from dependent data. In Advances in Neural Information Processing Systems 31, pp. 9424–9435. Shpitser, I. and J. Pearl (2008). Complete identiﬁcation methods for the causal hierarchy. Journal of Machine Learning Research, 1941–1979. Shpitser, I. and E. Sherman (2018). Identiﬁcation of personalized eﬀects associated with causal pathways. Uncertainty in artiﬁcial intelligence : proceedings of the ... conference. Conference on Uncertainty in Artiﬁcial Intelligence 2018. Smucler, E., A. Rotnitzky, and J. M. Robins (2019). A unifying approach for doubly-robust 1 regularized estimation of causal contrasts. arXiv preprint arXiv:1904.03737. Tan, Z. (2006). A distributional approach for causal inference using propensity scores. Journal of the American Statistical Association 101, 1619–1637. Tan, Z. (2007). Comment: Understanding or, ps and dr. Statistical Science 22, 560–568. Tan, Z. (2010). Bounded, eﬃcient and doubly robust estimation with inverse weighting. Biometrika 97, 661–682. Masatoshi Uehara (Harvard University) OPE December 25, 2019 48 / 50
49. 49. Ref VIII Thomas, P. and E. Brunskill (2016). Data-eﬃcient oﬀ-policy policy evaluation for reinforcement learning. In Proceedings of the 33rd International Conference on Machine Learning, 2139–2148. Tian, J. (2008). Identifying dynamic sequential plans. In Proceed-ings of the Twenty-Fourth Conference Annual Conferenceon Uncertainty in Artiﬁcial Intelligence (UAI-08), 554–561. Tsiatis, A. A. (2006). Semiparametric Theory and Missing Data. Springer Series in Statistics. New York, NY: Springer New York. Tsiatis, A. A. and M. Davidian (2007). Comment: Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data. Statistical science 22, 569–573. van Der Laan, M. and S. Gruber (2010). Collaborative double robust targeted maximum likelihood estimation. International Journal of Biostatistics 6, 1181–1181. van der Laan, M. J. (2011). Targeted Learning :Causal Inference for Observational and Experimental Data (1st ed. 2011. ed.). Springer Series in Statistics. New York, NY: Springer. Masatoshi Uehara (Harvard University) OPE December 25, 2019 49 / 50
50. 50. Ref IX van Der Laan, M. J. and J. M. Robins (2003). Uniﬁed Methods for Censored Longitudinal Data and Causality. Springer Series in Statistics,. New York, NY: Springer New York. van der Vaart, A. W. (1998). Asymptotic statistics. Cambridge, UK: Cambridge University Press. Vermeulen, K. and S. Vansteelandt (2015). Bias-reduced doubly robust estimation. Journal of the American Statistical Association 110, 1024–1036. Wang, Y. and J. Zubizarreta (2019a). Large sample properties of matching for balance. arXiv.org. Wang, Y. and J. Zubizarreta (2019b). Minimal dispersion approximately balancing weights: Asymptotic properties and practical considerations. arXiv.org. Wang, Y.-X., A. Agarwal, and M. Dud´ık (2017). Optimal and adaptive oﬀ-policy evaluation in contextual bandits. In Proceedings of the 34th International Conference on Machine Learning, Volume 70, pp. 3589–3597. Xie, T., Y. Ma, and Y.-X. Wang (2019). Towards optimal oﬀ-policy evaluation for reinforcement learning with marginalized importance sampling. In Advances in Neural Information Processing Systems 32, pp. 9665–9675. Masatoshi Uehara (Harvard University) OPE December 25, 2019 50 / 50