Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Off policy evaluation

Off policy evaluation from a causal inference perspective

  • Be the first to comment

Off policy evaluation

  1. 1. Off policy evaluation -survey- 1 Masatoshi Uehara (Harvard University) December 25, 2019 1 Disclaimer; this is a very casual note Masatoshi Uehara (Harvard University) OPE December 25, 2019 1 / 50
  2. 2. Overview 1 Motivation 2 Contextual bandit setting (With parametric models) 3 Bandit setting (With nonparametric models) 4 RL setting (Sequential or longitudinal setting) 5 Open Problems (General DAG, Mediation, Interference) Masatoshi Uehara (Harvard University) OPE December 25, 2019 2 / 50
  3. 3. Off policy evaluation (OPE) Goal is evaluating the value of the policy from the historical data. More formally, estimating the value of the evaluation policy πe from the data obtained by the behavior policy πb. Masatoshi Uehara (Harvard University) OPE December 25, 2019 3 / 50
  4. 4. Some notations from semiparametric theory Refer to (van der Vaart, 1998; Bickel et al., 1998; Tsiatis, 2006; Kennedy, 2016) (Semiparametric models)... Combination of parametric and nonparametric models (Semiparametric efficiency bound)....Extension of Cramer-Rao lower bound for parametric models to semiparametric models. (Influence function (IF) of the estimator and estimand)... φ(x) for ˆθ or θ∗ √ N(ˆθ − θ∗ ) = 1 √ N N i=1 φ(x(i) ) + op(1/ √ N) (Efficient influence function (EIF))... IF of the estimand minimizing the variance. (Efficient estimator).... Estimator achieving the efficiency bound Masatoshi Uehara (Harvard University) OPE December 25, 2019 4 / 50
  5. 5. Contextual bandit setting Setting We have {s(i), a(i), r(i)}N i=1 ∼ p(s)πb(a|s)p(r|s, a). We want to estimate Eπe [r] = Ep[rπe (a|s)] = rp(s)πe (a|s)p(r|s, a)dµ(r, s, a) . Good surveys (Rotnitzky and Vansteelandt, 2014; Seaman and Vansteelandt, 2018; Huber, 2019; Diaz, 2019) Unless otherwise noted, the expectation is taken w.r.t behavior policy Extension to conterfactual setting is easy EN[·] Empirical approximation Value function and Q-functions are defined for evalution policies Masatoshi Uehara (Harvard University) OPE December 25, 2019 5 / 50
  6. 6. CB; Semiparemtric Lower bound The efficiency bound under nonparametric model is var{v(s)} + E{η(s, a)2 var(r|s, a)}, where E(r|s, a) = q(s, a) and Eπe {E(r|s, a)|s} = v(s), η(s, a) = πe/πb. How to obtain? Approximate your infinite dimensional model as a parametric model. Then, calculate the supremum of the Cramer-Rao lower bound. Masatoshi Uehara (Harvard University) OPE December 25, 2019 6 / 50
  7. 7. Implication of semiparametric lower bound Semiparametric lower bound gives the lower bound of asymptotic MSE among regular estimators. Therefore, for example, var{v(s)} + E{η(s, a)2 var(r|s, a)} < var{η(s, a)r}. Importantly, this lower bound is not changed whether behavior policy is known or not. Masatoshi Uehara (Harvard University) OPE December 25, 2019 7 / 50
  8. 8. Common estimators IS (Importance sampling a.k.a IPW, HorvitzThompson); EN [ˆη(s, a)r] , πe(a|s) πb(a|s) = η(s, a) NIS (Normalized IS); EN [ˆη(a, s)r/EN[ˆη(s, a)]] DM (Direct method); EN[ˆq(s, a)], (E[r|a, s] = q(s, a)) AIS (Augmented IS (Robins et al., 1994; Dudik et al., 2014)); EN [ˆη(a, s)(r − ˆq(s, a)) + ˆv(s)] , (ˆv(s) = E[ˆq(s, a) | s]) Masatoshi Uehara (Harvard University) OPE December 25, 2019 8 / 50
  9. 9. Useful properties for AIS 1 Model double robustness (In terms of consistency and √ N–consistency) η(s, a) ≈ ˆη(s, a)? q(s, a) ≈ ˆq(s, a)? Red... Consistent, Green.... Not Consistent Masatoshi Uehara (Harvard University) OPE December 25, 2019 9 / 50
  10. 10. Useful properties for AIS 2 Rate double robustness ˆη − η 2 = op(N−1/4) and ˆq − q 2 = op(N−1/4) are sufficient conditions to guarantee the efficiency (Chernozhukov et al., 2018; Rotnitzky and Smucler, 2019) Fact regarding plug-in Even if nuisance functions are estimated with parametric √ N–rate, the asymptotic variance will be generally changed Thanks to the orthogonality of IF, the asymptotic variance is not changed even if there is plug-in (Rotnitzky et al., 2019) Masatoshi Uehara (Harvard University) OPE December 25, 2019 10 / 50
  11. 11. Double robust IS or Double robust direct estimator Double robust regression estimator (Scharfstein et al., 1999; Kang and Schafer, 2007) Learn q(s, a) with some covariate including ˆη(s, a) (weighted regression). Define an estimator as EN[ˆq(s, a)] This is double robust!! Close to TMLE Double robust IS estimator (Robins et al., 2007) Learn η(s, a) with some covariate based on ˆq(s, a) Define an IS estimator EN[ˆη(s, a)r] This is double robust Close to TMLE Masatoshi Uehara (Harvard University) OPE December 25, 2019 11 / 50
  12. 12. More doubly robust (MDR) estimator Motivation...AIS has poor performance when q(s, a) is mis-specified (Rubin and van Der Laan, 2008; Cao et al., 2009) MDR MDR is minimizing the variance among some class of estimators irrespective of the model-specification of q(s, a) When behavior policy is known, Q-function is estimated as follows; ˆq = arg min q∈Fq var{v(s)} + E{η(s, a)2 var(r|s, a)} . Then, plug it in DR. (Property)... Still double robust Can be extended when behavior policy is unknown. Extension to RL (Farajtabar et al., 2018) Masatoshi Uehara (Harvard University) OPE December 25, 2019 12 / 50
  13. 13. Intrinsic efficient estimator Motivation...The performance of AIS can become worse than IS or NIS (when q-models are mis-specified). Intrinsic efficient estimator (Tan, 2006, 2010) Making the class of estimator including IS and NIS, and optimizing so that the variance is minimized (Property)... Still double robust and better than IS and NIS Extension to RL (Kallus and Uehara, 2019c) Masatoshi Uehara (Harvard University) OPE December 25, 2019 13 / 50
  14. 14. Bias reduced estimator (Vermeulen and Vansteelandt, 2015) (Motivation)...What will happen when both models are mis-specified? Vermeulen and Vansteelandt (2015) has introduced an estimator based on the idea of reducing MSE irrespective of model-specifications. (Property)...Double robust and robust to model-misspecifications!! Masatoshi Uehara (Harvard University) OPE December 25, 2019 14 / 50
  15. 15. Nonparametric IS (Hirano et al., 2003) IS when πb is estimated nonparametrically This achieves the efficiency bound under some smoothness conditions Plug-in paradox (Robins et al., 1992; Henmi and Eguchi, 2004; Henmi et al., 2007) Plug-in estimator based on MLE is more efficient than non plug-in estimator If so, is plug-in IS estimator better than no plug-in estimator? Yes; If models are well-specified. Kind of using some control variate (Robins et al., 2007) No; If models are mis-specified Masatoshi Uehara (Harvard University) OPE December 25, 2019 15 / 50
  16. 16. Nonoparametric direct method Hahn (1998) introduce an estimator based on a direct method when q(a, x) is estimated nonparametrically This achieves the efficiency bound under some smoothness conditions Parametric direct method A.K.A G-formula (Hernan and Robins, 2019) We can also assume a parametric model for q(a, x) directly (semiparametric direct method). Efficiency bound under parametric q-model is smaller than efficiency bound under nonparametric model (Tan, 2007) Masatoshi Uehara (Harvard University) OPE December 25, 2019 16 / 50
  17. 17. Double debiased machine learning (Chernozhukov et al., 2018) The estimator is EN [ˆµ(a, s)(r − ˆq(s, a)) + ˆv(s)] , (Eπe [r|s] = v(s)) with cross fitting (aka. sample splitting) (van der Vaart, 1998) Both µ and q are estimated nonparametrically. Rate double robustness is attained without Donsker conditions for nuisance estimators Masatoshi Uehara (Harvard University) OPE December 25, 2019 17 / 50
  18. 18. TMLE (Rubin, 2006; van der Laan, 2011; Benkeser et al., 2017) TMLE??... Updating the estimator based on the efficient influence function of the target. (Super-learner is also used here) When EIF is analytically written, TMLE is reduced to a one-step estimator. See Page 11. When EIF does not have a closed form, iterative estimator. Corraborative double robustness (van Der Laan and Gruber, 2010; Diaz, 2018) Masatoshi Uehara (Harvard University) OPE December 25, 2019 18 / 50
  19. 19. Other important estimators Switching estimator (Tsiatis and Davidian, 2007; Wang et al., 2017) Matching estimator (Abadie and Imbens, 2006; Wang and Zubizarreta, 2019a) Covariate balancing with various divergences (Imai and Ratkovic, 2014; Wang and Zubizarreta, 2019b) Minimax estimator (Kallus, 2018; Chernozhukov et al., 2018; Hirshberg and Wager, 2019) High dimensional setting (Many... E.g. Farrell (2015); Smucler et al. (2019)) Continuous treatment (estimand is the difference) (Kennedy et al., 2017) Finite population inference (Bojinov and Shephard, 2019) Multiple robustness (Rotnitzky et al., 2017) Masatoshi Uehara (Harvard University) OPE December 25, 2019 19 / 50
  20. 20. RL setting (Application) Figure: ADHD Example [Chakraborty,2009] Masatoshi Uehara (Harvard University) OPE December 25, 2019 20 / 50
  21. 21. Summary of RL situation Table: Efficiency bounds and estimators for OPE Efficiency bound Efficient estimator NMDP Kallus and Uehara (2019a) Jiang and Li (2016) Thomas and Brunskill (2016) TMDP Kallus and Uehara (2019a) Kallus and Uehara (2019a) MDP Kallus and Uehara (2019b) Kallus and Uehara (2019b) Jiang and Li (2016) also calculated bounds of NMDP and TMDP for a tabular case. Note that efficiency bound and estimator under NMDP are kind of given in causal inference literature (Murphy, 2003; van Der Laan and Robins, 2003; Bang and Robins, 2005) Masatoshi Uehara (Harvard University) OPE December 25, 2019 21 / 50
  22. 22. MDP MDP = {S, A, R, p} S, A, R... State space, Action space, Reward space Transition density... p(s |s, a) Reward distribution.... p(r|s, a) Initial distribution.... p(0) (s0) Evaluation policy πe(a|s), behavior policy πb(a|s) The induced distribution by MDP and the behavior policy is p(s0, a0, r0, a0, s1, a1, r1, s2, a2, r2, · · · ) = p(0) (s0)πb (a0|s0)p(r0|s0, a0)p(s1|s0, a0)πb (a1|s1)p(r1|s1, a1) · · · . s0 a0 r0 s1 a1 r1 s2 | | | | || || || || ||| ||| Figure: MDP Masatoshi Uehara (Harvard University) OPE December 25, 2019 22 / 50
  23. 23. NMDP and TMDP MDP can be relaxed into two ways; NMDP (without Markovness) and TMDP (Without time-invariance) Figure: NMDP (Non-Markov Decision process) Figure: TMDP (Time-varying Markov Decision Process) Masatoshi Uehara (Harvard University) OPE December 25, 2019 23 / 50
  24. 24. Goal in OPE for RL [Goal]; Estimate ρπe ; ρπe = (1 − γ) ∞ t=0 Eπe [γt rt], (γ < 1) Note that this expectation is taken w.r.t p (0) e (s0)πe (a0|s0)p(r0|s0, a0)p(s1|s0, a0)πe (a1|s1)p(r1|s1, a1) · · · . We can use a set of samples generated by MDP and the behavior policy πb; {s (i) t , a (i) t , r (i) t }N,T i=1,t=0. following p (0) b (s0)πb (a0|s0)p(r0|s0, a0)p(s1|s0, a0)πb (a1|s1)p(r1|s1, a1) · · · . Masatoshi Uehara (Harvard University) OPE December 25, 2019 24 / 50
  25. 25. Common three approaches DM (Direct Method) estimator ˆρDM = (1 − γ) EN [Eπe [ˆq(s0, a0)|s0]] , where E[ ∞ t=0 γtrt|s0, a0] = q(s0, a0). SIS (Sequential Importance Sampling) estimator ˆρSIS = (1 − γ) EN T t=0 γt νtrt , where νt(Hat ) = t k=0 ηk(sk, ak), ηk(sk, ak) = πe(ak|sk) πb(ak|sk) . Double Robust (DR) estimator (Jiang and Li, 2016; Thomas and Brunskill, 2016) ˆρDR = (1 − γ) EN T t=0 γt (νt(rt − ˆqt) + νt−1 ˆvt(st)) . Masatoshi Uehara (Harvard University) OPE December 25, 2019 25 / 50
  26. 26. Curse of horizon Eπe [ T t=0 γt rt] = Eπb T k=0 πe(ak|sk) πb(ak|sk) T t=0 γt rt = Eπb T t=0 t k=0 πe(ak|sk) πb(ak|sk) γt rt = Eπb T t=0 γt νtrt ≈ EN T t=0 γt νtrt Problem: Variance grows exponentially w.r.t T. Masatoshi Uehara (Harvard University) OPE December 25, 2019 26 / 50
  27. 27. Curse of horizon SIS and DR estimator suffer from the curse of horizon DM estimator does not. But it suffer from the model misspefication. Q; Are there any solutions? A; MDP assumptions are not exploited fully. Markov assumption Time-invariant assumption Masatoshi Uehara (Harvard University) OPE December 25, 2019 27 / 50
  28. 28. Leveraging Markovness Xie et al. (2019) proposed a marginal importance sampling estimator; Eπe T t=0 γt rt = Eπb T t=0 t k=0 πe(ak|sk) πb(ak|sk) γt rt = Eπb T t=0 γt µtrt ≈ EN T t=0 γt µtrt . Here, µt is a marginal density ratio at t; µt = pπe (st, at) pπb (st, at) . Masatoshi Uehara (Harvard University) OPE December 25, 2019 28 / 50
  29. 29. Efficiency bound under NMDP and TMDP Theorem (EB under NMDP) EB(M1) = (1 − γ)2 ∞ k=1 E[γ2(k−1) ν2 k−1var rk−1 + vk|Hak−1 ]. Theorem (EB under TMDP) EB(M2) = (1 − γ)2 ∞ k=1 E[γ2(k−1) µ2 k−1var (rk−1 + vk|ak−1, sk−1)]. Typical behavior of ν2 k−1 is O(Ck). Typical behavior of µ2 k−1 is O(1). Variance does not grow exponentially w.r.t T under TMDP Masatoshi Uehara (Harvard University) OPE December 25, 2019 29 / 50
  30. 30. Double Reinforcement learning (for TMDP) Kallus and Uehara (2019a) has proposed an estimator (DRL) achieving the efficiency bound under TMDP; ˆρDRL(M2) = (1 − γ) EN T t=0 γt (ˆµt(rt − ˆqt) + ˆµt−1 ˆvt(st)) . Masatoshi Uehara (Harvard University) OPE December 25, 2019 30 / 50
  31. 31. Double robustness of DRL for TMDP Model double robustness (Also, rate double robustness) µt(s, a) ≈ ˆµt(s, a)? qt(s, a) ≈ ˆqt(s, a)? Red... Consistent, Green.... Not Consistent Masatoshi Uehara (Harvard University) OPE December 25, 2019 31 / 50
  32. 32. Curse of horizon (Again) Q: Is the curse of horizon solved? A: At least, it does not blow up w.r.t horizon. But, rate is not right under MDP Masatoshi Uehara (Harvard University) OPE December 25, 2019 32 / 50
  33. 33. Correct rate for OPE under Ergodic MDP The rate of the estimator (MSE) introduced so far is 1/N However, we can learn the estimand with 1/NT–rate assuming Ergodicity. Importantly, we can learn from a single trajectory (N = 1, T → ∞) Masatoshi Uehara (Harvard University) OPE December 25, 2019 33 / 50
  34. 34. Leveraging Time-invariance Liu et al. (2018) proposed an Ergodic importance sampling estimator; lim T→∞ (1 − γ)Eπe T t=0 γt rt = rp∞ e,γ(s, a)dµ(s, a, r) = r p∞ e,γ(s, a) p∞ b (s, a) p∞ b (s, a)dµ(s, a, r) = Eπ∞ b [rw(s, a)] ≈ ENET [rw(s, a)] = 1 N 1 T N i=1 T t=1 r (i) t w(s (i) t , a (i) t ) where p∞ e,γ(s, a) is an average visitation distribution of state and action, w(s, a) = p∞ e,γ(s, a) p∞ b (s, a) . Masatoshi Uehara (Harvard University) OPE December 25, 2019 34 / 50
  35. 35. Efficiency bound under Ergodic MDP The lower bound of asymptotic MSE scaled by NT among regular estimators is EB(M3) = Ep (∞) b   w2 (s, a) Distribution mismatch {r + γv(s ) − q(s, a)}2 Bellman squared residual    . Table: Comparison regarding rate Rate Curse of horizon NMDP O(1/N) Yes TMDP O(1/N) No. But still rate... MDP O(1/NT) No Masatoshi Uehara (Harvard University) OPE December 25, 2019 35 / 50
  36. 36. Efficient estimator under ergodic MDP Defining v(s0) = Eπe [q(s0, a0)|s0], efficient estimator ˆρDRL(M3) is defined as follows; (1 − γ)ENEp (0) e [ˆv(s0)] + ENET [ ˆw(s, a)(r + γˆv(s ) − ˆq(s, a))], or ENET [ ˆw(s, a)r] + (1 − γ)ENEp (0) e [ˆv(s0)] + ENET [ ˆw(s, a)(r + γˆv(s ) − ˆq(s, a))]. Here, Red terms correspond to IS estimator or DM estimator. And, Blue terms correspond to control variates. Masatoshi Uehara (Harvard University) OPE December 25, 2019 36 / 50
  37. 37. Double robustness of DRL for MDP Model double robustness (Also, rate double robustness) w(s, a) ≈ ˆw(s, a)? q(s, a) ≈ ˆq(s, a)? Red... Consistent, Green.... Not Consistent Masatoshi Uehara (Harvard University) OPE December 25, 2019 37 / 50
  38. 38. General causal DAG (when all of variables are measured) Given causal DAG (FFRCISTG), G-formula or IS estimator give identification formulas (Hernan and Robins, 2019). How to obtain an efficient estimator? ... See van Der Laan and Robins (2003) The problem is how the estimator can be simplified (Rotnitzky and Smucler, 2019). Masatoshi Uehara (Harvard University) OPE December 25, 2019 38 / 50
  39. 39. General causal DAG (With unmeasured variables) ID algorithm (Shpitser and Pearl, 2008; Tian, 2008; Shpitser and Sherman, 2018) gives a sufficient and necessary identification formula. The relation with efficient estimation is a still opening problem?? Figure: With unmeasured confounding Masatoshi Uehara (Harvard University) OPE December 25, 2019 39 / 50
  40. 40. Mediation effect (Pathway effect) Modified ID algorithm (Shpitser and Sherman, 2018) is a sufficient and necessary identification formula. Efficient theory is still being constructed (Nabi et al., 2018). Figure: Edge intervention Masatoshi Uehara (Harvard University) OPE December 25, 2019 40 / 50
  41. 41. Network, interference Some estimation method and its theory(Ogburn et al., 2017). Chain graph is also useful for network setting (Ogburn et al., 2018) Figure: Chain Graph Identification formula is given (Sherman and Shpitser, 2018) Since each unit is not i.i.d, difficult E.g. To estimate from a single network, ergodicity is needed. Ordinary semiparmatric theory assume i.i.d. Masatoshi Uehara (Harvard University) OPE December 25, 2019 41 / 50
  42. 42. Ref I Abadie, A. and G. W. Imbens (2006). Large sample properties of matching estimators for average treatment effects. Econometrica 74, 235–267. Bang, H. and J. M. Robins (2005). Doubly robust estimation in missing data and causal inference models. Biometrics 61, 962–973. Benkeser, D., M. Carone, M. J. V. D. Laan, and P. B. Gilbert (2017). Doubly robust nonparametric inference on the average treatment effect. Biometrika 104, 863–880. Bickel, P. J., C. A. J. Klaassen, Y. Ritov, and J. A. Wellner (1998). Efficient and Adaptive Estimation for Semiparametric Models. Springer. Bojinov, I. and N. Shephard (2019). Time series experiments and causal estimands: exact randomization tests and trading. Journal of the American Statistical Association. Cao, W., A. A. Tsiatis, and M. Davidian (2009). Improving efficiency and robustness of the doubly robust estimator for a population mean with incomplete data. Biometrika 96, 723–734. Chernozhukov, V., D. Chetverikov, M. Demirer, E. Duflo, C. Hansen, W. Newey, and J. Robins (2018). Double/debiased machine learning for treatment and structural parameters. Econometrics Journal 21, C1–C68. Masatoshi Uehara (Harvard University) OPE December 25, 2019 42 / 50
  43. 43. Ref II Chernozhukov, V., W. Newey, J. Robins, and R. Singh (2018). Double/de-biased machine learning of global and local parameters using regularized riesz representers. arXiv.org. Diaz, I. (2018). Doubly robust estimators for the average treatment effect under positivity violations: introducing the e-score. arXiv.org. Diaz, I. (2019). Machine learning in the estimation of causal effects: targeted minimum loss-based estimation and double/debiased machine learning. Biostatistics. Dudik, M., D. Erhan, J. Langford, and L. Li (2014). Doubly robust policy evaluation and optimization. Statistical Science 29, 485–511. Farajtabar, M., Y. Chow, and M. Ghavamzadeh (2018). More robust doubly robust off-policy evaluation. In Proceedings of the 35th International Conference on Machine Learning, 1447–1456. Farrell, M. H. (2015). Robust inference on average treatment effects with possibly more covariates than observations. Journal of Econometrics 189, 1–23. Hahn, J. (1998). On the role of the propensity score in efficient semiparametric estimation of average treatment effects. Econometrica 66, 315–331. Henmi, M. and S. Eguchi (2004). A paradox concerning nuisance parameters and projected estimating functions. Biometrika 91, 929–941. Masatoshi Uehara (Harvard University) OPE December 25, 2019 43 / 50
  44. 44. Ref III Henmi, M., R. Yoshida, and S. Eguchi (2007). Importance sampling via the estimated sampler. Biometrika 94, 985–991. Hernan, M. and J. Robins (2019). Causal Inference. Boca Raton: Chapman & Hall/CRC. Hirano, K., G. Imbens, and G. Ridder (2003). Efficient estimation of average treatment effects using the estimated propensity score. Econometrica 71, 1161–1189. Hirshberg, D. and S. Wager (2019). Augmented minimax linear estimation. arXiv.org. Huber, M. (2019). An introduction to flexible methods for policy evaluation. arXiv.org. Imai, K. and M. Ratkovic (2014). Covariate balancing propensity score. J. R. Statist. Soc. B 76, 243–263. Jiang, N. and L. Li (2016). Doubly robust off-policy value evaluation for reinforcement learning. In Proceedings of the 33rd International Conference on International Conference on Machine Learning-Volume, 652–661. Kallus, N. (2018). Balanced policy evaluation and learning. In Advances in Neural Information Processing Systems 31, pp. 8895–8906. Kallus, N. and M. Uehara (2019a). Double reinforcement learning for efficient off-policy evaluation in markov decision processes. arXiv preprint arXiv:1908.08526. Masatoshi Uehara (Harvard University) OPE December 25, 2019 44 / 50
  45. 45. Ref IV Kallus, N. and M. Uehara (2019b). Efficiently breaking the curse of horizon: Double reinforcement learning in infinite-horizon processes. arXiv preprint arXiv:1909.05850. Kallus, N. and M. Uehara (2019c). Intrinsically efficient, stable, and bounded off-policy evaluation for reinforcement learning. In Advances in Neural Information Processing Systems 32, pp. 3320–3329. Kang, J. D. Y. and J. L. Schafer (2007). Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data. Statistical Science 22, 523–529. Kennedy, E. (2016). Semiparametric theory and empirical processes in causal inference. arXiv.org. Kennedy, E. H., Z. Ma, M. D. Mchugh, and D. S. Small (2017). Nonparametric methods for doubly robust estimation of continuous treatment effects. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 79, 1229–1245. Liu, Q., L. Li, Z. Tang, and D. Zhou (2018). Breaking the curse of horizon: Infinite-horizon off-policy estimation. In Advances in Neural Information Processing Systems 31, pp. 5356–5366. Murphy, S. A. (2003). Optimal dynamic treatment regimes. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 65, 331–355. Masatoshi Uehara (Harvard University) OPE December 25, 2019 45 / 50
  46. 46. Ref V Nabi, R., P. Kanki, and I. Shpitser (2018). Estimation of personalized effects associated with causal pathways. Uncertainty in artificial intelligence : proceedings of the ... conference. Conference on Uncertainty in Artificial Intelligence 2018. Ogburn, E., O. Sofrygin, and I. Diaz (2017). Causal inference for social network data. arXiv.org. Ogburn, E. L., I. Shpitser, and Y. Lee (2018). Causal inference, social networks, and chain graphs. arXiv.org. Robins, J., M. Sued, Q. Lei-Gomez, and A. Rotnitzky (2007). Comment: Performance of double-robust estimators when ”inverse probability” weights are highly variable. Statistical Science 22, 544–559. Robins, J. M., S. D. Mark, and W. K. Newey (1992). Estimating exposure effects by modelling the expectation of exposure conditional on confounders. Biometrics 48, 479–495. Robins, J. M., A. Rotnitzky, and L. P. Zhao (1994). Estimation of regression coefficients when some regressors are not always observed. Journal of the American Statistical Association 89, 846–866. Rotnitzky, A., J. Robins, and L. Babino (2017). On the multiply robust estimation of the mean of the g-functional. arXiv preprint arXiv:1705.08582. Masatoshi Uehara (Harvard University) OPE December 25, 2019 46 / 50
  47. 47. Ref VI Rotnitzky, A. and E. Smucler (2019). Efficient adjustment sets for population average treatment effect estimation in non-parametric causal graphical models. arXiv.org. Rotnitzky, A., E. Smucler, and J. Robins (2019). Characterization of parameters with a mixed bias property. arXiv preprint arXiv:1509.02556. Rotnitzky, A. and S. Vansteelandt (2014). Double-robust methods. In Handbook of missing data methodology. In Handbooks of Modern Statistical Methods, pp. 185–212. Chapman and Hall/CRC. Rubin, D. (2006). Targeted maximum likelihood learning. The International Journal of Biostatistics 2, 1043–1043. Rubin, D. B. and M. J. van Der Laan (2008). Empirical efficiency maximization: improved locally efficient covariate adjustment in randomized experiments and survival analysis. The international journal of biostatistics 4. Scharfstein, D., A. Rotnizky, and J. M. Robins (1999). Adjusting for nonignorable dropout using semi-parametric models. Journal of the American Statistical Association 94, 1096–1146. Seaman, S. R. and S. Vansteelandt (2018). Introduction to double robust methods for incomplete data. Statistical science 33, 184–197. Masatoshi Uehara (Harvard University) OPE December 25, 2019 47 / 50
  48. 48. Ref VII Sherman, E. and I. Shpitser (2018). Identification and estimation of causal effects from dependent data. In Advances in Neural Information Processing Systems 31, pp. 9424–9435. Shpitser, I. and J. Pearl (2008). Complete identification methods for the causal hierarchy. Journal of Machine Learning Research, 1941–1979. Shpitser, I. and E. Sherman (2018). Identification of personalized effects associated with causal pathways. Uncertainty in artificial intelligence : proceedings of the ... conference. Conference on Uncertainty in Artificial Intelligence 2018. Smucler, E., A. Rotnitzky, and J. M. Robins (2019). A unifying approach for doubly-robust 1 regularized estimation of causal contrasts. arXiv preprint arXiv:1904.03737. Tan, Z. (2006). A distributional approach for causal inference using propensity scores. Journal of the American Statistical Association 101, 1619–1637. Tan, Z. (2007). Comment: Understanding or, ps and dr. Statistical Science 22, 560–568. Tan, Z. (2010). Bounded, efficient and doubly robust estimation with inverse weighting. Biometrika 97, 661–682. Masatoshi Uehara (Harvard University) OPE December 25, 2019 48 / 50
  49. 49. Ref VIII Thomas, P. and E. Brunskill (2016). Data-efficient off-policy policy evaluation for reinforcement learning. In Proceedings of the 33rd International Conference on Machine Learning, 2139–2148. Tian, J. (2008). Identifying dynamic sequential plans. In Proceed-ings of the Twenty-Fourth Conference Annual Conferenceon Uncertainty in Artificial Intelligence (UAI-08), 554–561. Tsiatis, A. A. (2006). Semiparametric Theory and Missing Data. Springer Series in Statistics. New York, NY: Springer New York. Tsiatis, A. A. and M. Davidian (2007). Comment: Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data. Statistical science 22, 569–573. van Der Laan, M. and S. Gruber (2010). Collaborative double robust targeted maximum likelihood estimation. International Journal of Biostatistics 6, 1181–1181. van der Laan, M. J. (2011). Targeted Learning :Causal Inference for Observational and Experimental Data (1st ed. 2011. ed.). Springer Series in Statistics. New York, NY: Springer. Masatoshi Uehara (Harvard University) OPE December 25, 2019 49 / 50
  50. 50. Ref IX van Der Laan, M. J. and J. M. Robins (2003). Unified Methods for Censored Longitudinal Data and Causality. Springer Series in Statistics,. New York, NY: Springer New York. van der Vaart, A. W. (1998). Asymptotic statistics. Cambridge, UK: Cambridge University Press. Vermeulen, K. and S. Vansteelandt (2015). Bias-reduced doubly robust estimation. Journal of the American Statistical Association 110, 1024–1036. Wang, Y. and J. Zubizarreta (2019a). Large sample properties of matching for balance. arXiv.org. Wang, Y. and J. Zubizarreta (2019b). Minimal dispersion approximately balancing weights: Asymptotic properties and practical considerations. arXiv.org. Wang, Y.-X., A. Agarwal, and M. Dud´ık (2017). Optimal and adaptive off-policy evaluation in contextual bandits. In Proceedings of the 34th International Conference on Machine Learning, Volume 70, pp. 3589–3597. Xie, T., Y. Ma, and Y.-X. Wang (2019). Towards optimal off-policy evaluation for reinforcement learning with marginalized importance sampling. In Advances in Neural Information Processing Systems 32, pp. 9665–9675. Masatoshi Uehara (Harvard University) OPE December 25, 2019 50 / 50

×