Successfully reported this slideshow.
Upcoming SlideShare
×

# Off policy learning

Off policy learning from a causal inference perspective

• Full Name
Comment goes here.

Are you sure you want to Yes No

### Off policy learning

1. 1. Oﬀ policy learning -survey-1 Masatoshi Uehara (Harvard University) December 25, 2019 1 Disclaimer; this is a very casual note Masatoshi Uehara (Harvard University) OPL December 25, 2019 1 / 26
2. 2. Overview 1 Motivation 2 Oﬀ policy Learning 3 Opening Questions Masatoshi Uehara (Harvard University) OPL December 25, 2019 2 / 26
3. 3. Goal Oﬀ policy evaluation (OPE) Goal is evaluating the value of the policy from the historical data. Oﬀ policy learning (a.k.a welfare maximization, counterfactual learning) (Murphy, 2003) Goal is ﬁnding the policy maximizing the value of the policy. Application Advertisement Medical treatment Good surveys... (Stats perspective) Chakraborty and Moodie (2013); Kosorok and Moodie (2015); Kosorok and Laber (2019), (RL perspective) Szepesvari (2010) Masatoshi Uehara (Harvard University) OPL December 25, 2019 3 / 26
4. 4. Formal setting Goal is ﬁnding an optimal policy maximizing the value of the policy; E[Y (π(X))] = Yp(y | a, x)π(a|x)p(x)dµ(y, a, x). among π ∈ Π. When Π is a large class, the optimal policy is π∗ (X) = arg max A E[Y |A, X]. In oﬀ-policy setting, the available data is (X(i) , Y (i) , A(i) )n i=1 ∼ p(Y | A, X)πb(A | X)p(X). ** Basically, the goal is almost same as the RL. Masatoshi Uehara (Harvard University) OPL December 25, 2019 4 / 26
5. 5. Regret of the estimator Regret Regret of the estimator ˆπ(X) is deﬁned as E[Y (π∗ (X))] − E[Y (ˆπ(X))], where π∗(X) is an optimal policy π∗ (X) = arg max π∈Π E[Y (π(X))] and ˆπ(X) is an estimator. We want to upper bound this regret by R(Π, n). Masatoshi Uehara (Harvard University) OPL December 25, 2019 5 / 26
6. 6. IPW way Oﬀ-policy learning based on IPW estimator; ˆπ = arg max π∈Π n i=1 π(A(i) | X(i) ) Y (i) πb(A(i) | X(i)) . Literatures Outcome weighted learning (Biostats,stats (Zhao et al., 2012; Zhang et al., 2013)) Welfare maximization (Econ (Kitagawa and Tetenov, 2018b)) Counterfactual risk minimization (ML (Swaminathan and Joachims, 2015a,b)) Masatoshi Uehara (Harvard University) OPL December 25, 2019 6 / 26
7. 7. Regression way (Based on CATE) CATE is deﬁned as Q(A, X) = E[Y | A, X]. Then, we can ﬁnd an optimal policy as ˆπ(X) = arg max A ˆQ(A, X). There are many ways to estimate CATE.... Masatoshi Uehara (Harvard University) OPL December 25, 2019 7 / 26
8. 8. Eﬃcient policy learning Estimator based on an eﬃcient inﬂuence function (Dudik et al., 2014; Zhang et al., 2016) is deﬁned as ˆπ = arg max π∈Π n i=1 [π(A(i) | X(i) ) Y (i) − ˆQ(A(i), X(i)) πb(A(i) | X(i)) + a∈A π(a | X(i) ) ˆQ(a, X(i) )]. Athey and Wager (2017); Zhou et al. (2018) generalized this result with tight regret guarantee. The regret is Op( VC(Π) supπ V (π)/n), where V (π) corresponds to a variance term and VC(Π) is a VC-demension of Π. Masatoshi Uehara (Harvard University) OPL December 25, 2019 8 / 26
9. 9. Policy learning with fast rate (Luedtke and Chambaz, 2017) The aforementioned regret is Op( 1/n). Actually, it it known the regret is op( 1/n) with some mild assumptions (margin conditions Koltchinskii (2006)). If so, is it meaningful to consider the eﬃciency? (variance term) Yes, this would be related to a high order term. Masatoshi Uehara (Harvard University) OPL December 25, 2019 9 / 26
10. 10. Policy learning with continuous treatment IPW way me be ill-posed when A is continuous. Kallus and Zhou (2018b) use the following transformation; n i=1 π(A(i) | X(i))Y (i) πb(A(i) | X(i)) = n i=1 π(a | X(i))I(A(i) = a)dµ(a) πb(A(i) | X(i)) ≈ n i=1 π(a | X(i))Kh(A(i), a)dµ(a) πb(A(i) | X(i)) . ** Same as adding the noise the class of policies π ∈ Π (Foster and Syrgkanis, 2019) Masatoshi Uehara (Harvard University) OPL December 25, 2019 10 / 26
11. 11. Policy learning with surrogate loss functions Indicator loss is non–convex. The optimization is diﬃcult. Since policy learning is an IPW ERM, by replacing the indicator loss with convex surrogate loss functions, the optimization becomes easier (Qian and Murphy, 2011; Zhao et al., 2019). Masatoshi Uehara (Harvard University) OPL December 25, 2019 11 / 26
12. 12. Orthogonal statistical learning (Foster and Syrgkanis, 2019) Consider a general ERM problem with some plug–in estimator. If the original ERM problem has some orthogonal structure, the there is no plug-in eﬀect. (Basically, ﬁne if the plug-in rate is a half rate of the one of the original problem) Application Treatment Eﬀect Estimation (Partially linear model) (Chernozhukov et al., 2018) Policy learning Domain adaptation (Covariate shift) Remark... Orthogonality and eﬃciency are diﬀerent concepts though related. Masatoshi Uehara (Harvard University) OPL December 25, 2019 12 / 26
13. 13. Policy learning with IV Double robust estimation based on some treatment eﬀect model (Abadie and Imbens, 2006; Okui et al., 2012; Schulte et al., 2015) Applying it to policy learning (Athey and Wager, 2017) Masatoshi Uehara (Harvard University) OPL December 25, 2019 13 / 26
14. 14. Policy learning for longitudinal data (RL) Assume the data generating process is (X0, A0, Y0, X1, A1, · · · , YT ). Figure: NMDP (Non-Markov Decision process) Figure: TMDP (Time-varying Markov Decision Process) Goal Finding a policy maximizing Eπ[ T t=0 Yt] among π ∈ Π. Masatoshi Uehara (Harvard University) OPL December 25, 2019 14 / 26
15. 15. Policy learning for longitudinal data (IPW) The natural idea is using a IPW estiamtor for OPE. IPW way (Davidian et al., 2015) {πt}T t=1 = arg max {πt } n i=1 T t=0 t k=0 πk(A (i) k | X (i) k )Yt πb,k(A (i) k | X (i) k ) Masatoshi Uehara (Harvard University) OPL December 25, 2019 15 / 26
16. 16. Policy learning for longitudinal data (FQI) ** Corresponding a DM FQI (for ﬁnite horizon) FQI is deﬁned as an iterative procedure (Szepesvari and Munos, 2005; Chen and Jiang, 2019) ˆQt(x, a) = arg min q∈FQt n i=1 {Y (i) t + arg max a Qt+1(X (i) t+1, a) − ˆQt(X (i) t , A (i) t )}2 , based on ˆQT = 0. The following quantity characterizes the inherent diﬃculty of FQI; sup g∈F inf f ∈F Bg − f 2, where B is a Bellman–operator for q-functions. Masatoshi Uehara (Harvard University) OPL December 25, 2019 16 / 26
17. 17. Policy learning for longitudinal data Following a bandit case, IPW and DM method can be extended to a double robust score version (Zhang et al., 2013; Nie et al., 2019). However, the regret will grow exponentially w.r.t the horizon. Can this problem be relaxed under MDP? Masatoshi Uehara (Harvard University) OPL December 25, 2019 17 / 26
18. 18. Policy learning with unconfounding variables Assume the propensity πb belongs to some weight class W (Missing not at random setting) The idea of Kallus and Zhou (2018a) is ﬁnding a robust policy; arg max π∈Π min πb∈W E Y πe(A | X) πb(A | X) . The additional issues are the optimization and regret guarantee. Masatoshi Uehara (Harvard University) OPL December 25, 2019 18 / 26
19. 19. Other estimators or situations Multiple loggers (stratiﬁed sampling) (He et al., 2019) When value function take special (though general enough) form when A is continuous (Chernozhukov et al., 2019) A-learning (closed to partially linear model, structural nested mean models) (Robins, 2004; Schulte et al., 2015) Oﬀ-Policy gradient (Assuming policy class is smooth) (Chen et al., 2018) Incorporating Fairness (Corbett-Davies et al., 2017; Kitagawa and Tetenov, 2018a; Metevier et al., 2019) Oﬀ-policy learning to rank (Joachims et al., 2017) Finite population inference (Imai and Li, 2019) Masatoshi Uehara (Harvard University) OPL December 25, 2019 19 / 26
20. 20. Open questions Inference (Laber et al., 2014; Andrews et al., 2018; Jiang et al., 2019) Eﬃcient Oﬀ-policy learning for RL...Nuisance functions are related to the policy itself (Nie et al., 2019) Eﬃcient oﬀ-policy learning on general DAG with unmeasured variables Oﬀ-policy learning on network (Viviano, 2019) Oﬀ-policy learning with partial identiﬁcation Masatoshi Uehara (Harvard University) OPL December 25, 2019 20 / 26
21. 21. Ref I Abadie, A. and G. W. Imbens (2006). Large sample properties of matching estimators for average treatment eﬀects. Econometrica 74, 235–267. Andrews, I., T. Kitagawa, and A. Mccloskey (2018). Inference on winners. IDEAS Working Paper Series from RePEc. Athey, S. and S. Wager (2017). Eﬃcient policy learning. arXiv preprint arXiv:1702.02896. Chakraborty, B. and E. Moodie (2013). Statistical methods for dynamic treatment regimes. Springer. Chen, J. and N. Jiang (2019). Information-theoretic considerations in batch reinforcement learning. In Proceedings of the 36th International Conference on Machine Learning, Volume 97, pp. 1042–1051. Chen, M., A. Beutel, P. Covington, S. Jain, F. Belletti, and E. Chi (2018). Top-k oﬀ-policy correction for a reinforce recommender system. WSDM 2018. Chernozhukov, V., D. Chetverikov, M. Demirer, E. Duﬂo, C. Hansen, W. Newey, and J. Robins (2018). Double/debiased machine learning for treatment and structural parameters. Econometrics Journal 21, C1–C68. Masatoshi Uehara (Harvard University) OPL December 25, 2019 21 / 26
22. 22. Ref II Chernozhukov, V., M. Demirer, G. Lewis, and V. Syrgkanis (2019). Semi-parametric eﬃcient policy learning with continuous actions. In Advances in Neural Information Processing Systems 32. Corbett-Davies, S., E. Pierson, A. Feller, S. Goel, and A. Huq (2017). Algorithmic decision making and the cost of fairness. In Proceedings of the 23rd ACM SIGKDD International Conference on knowledge discovery and data mining, Volume 129685 of KDD ’17, pp. 797–806. ACM. Davidian, M., A. Tsiatis, and E. Laber (2015). Value search estimators for optimal dynamic treatment regimes. In Adaptive Treatment Strategies in Practice, pp. 135–155. Society for Industrial and Applied Mathematics. Dudik, M., D. Erhan, J. Langford, and L. Li (2014). Doubly robust policy evaluation and optimization. Statistical Science 29, 485–511. Foster, D. and V. Syrgkanis (2019). Orthogonal statistical learning. arXiv.org. He, L., X. Long, W. Zeng, Y. Zhao, and D. Yin (2019). Oﬀ-policy learning for multiple loggers. arXiv.org. Imai, K. and M. Li (2019). Experimental evaluation of individualized treatment rules. arXiv.org. Masatoshi Uehara (Harvard University) OPL December 25, 2019 22 / 26
23. 23. Ref III Jiang, B., R. Song, J. Li, and D. Zeng (2019). Entropy learning for dynamic treatment regimes. Statistica Sinica 29, 1633–1655. Joachims, T., A. Swaminathan, and T. Schnabel (2017). Unbiased learning-to-rank with biased feedback. In Proceedings of the Tenth ACM International Conference on web search and data mining, WSDM ’17, pp. 781–789. ACM. Kallus, N. and A. Zhou (2018a). Confounding-robust policy improvement. In Advances in Neural Information Processing Systems 31, pp. 9269–9279. Kallus, N. and A. Zhou (2018b). Policy evaluation and optimization with continuous treatments. International Conference on Artiﬁcial Intelligence and Statistics, 1243–1251. Kitagawa, T. and A. Tetenov (2018a). Equality-minded treatment choice. IDEAS Working Paper Series from RePEc. Kitagawa, T. and A. Tetenov (2018b). Who should be treated? empirical welfare maximization methods for treatment choice. Econometrica 86, 591–616. Koltchinskii, V. (2006). Local rademacher complexities and oracle inequalities in risk minimization. The Annals of Statistics 34, 2593–2656. Kosorok, M. R. and E. B. Laber (2019). Precision medicine. Annual Re-view of Statistics and Its Application 6, 263–286. Masatoshi Uehara (Harvard University) OPL December 25, 2019 23 / 26
24. 24. Ref IV Kosorok, M. R. and E. E. M. Moodie (2015). Adaptive Treatment Strategies in Practice. Society for Industrial and Applied Mathematics. Laber, E. B., D. J. Lizotte, M. Qian, W. E. Pelham, and S. A. Murphy (2014). Dynamic treatment regimes: technical challenges and applications. Electronic journal of statistics 8, 1225–1272. Luedtke, A. and A. Chambaz (2017). Faster rates for policy learning. arXiv.org. Metevier, B., S. Giguere, S. Brockman, A. Kobren, Y. Brun, E. Brunskill, and P. S. Thomas (2019). Oﬄine contextual bandits with high probability fairness guarantees. In Advances in Neural Information Processing Systems 32, pp. 14893–14904. Murphy, S. A. (2003). Optimal dynamic treatment regimes. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 65, 331–355. Nie, X., E. Brunskill, and S. Wager (2019). Learning when-to-treat policies. arXiv.org. Okui, R., D. S. Small, Z. Tan, and J. M. Robins (2012). Doubly robust instrumental variable regression. Statistica Sinica 22, 173–205. Qian, M. and S. Murphy (2011). Performance guarantees for individualized treatment rules. Annals of Statistics 39, 1180. Masatoshi Uehara (Harvard University) OPL December 25, 2019 24 / 26
25. 25. Ref V Robins, J. (2004). Optimal structural nested models for optimal sequentialdecisions. In Proceedings of the Second Seattle Symposium in Biostatistics: Analysis of Correlated Data. Schulte, P., A. Tsiatis, E. Laber, and M. Davidian (2015). q- and a-learning methods for estimating optimal dynamic treatment regimes. Statistical Science, 29. Swaminathan, A. and T. Joachims (2015a). Batch learning from logged bandit feedback through counterfactual risk minimization. Journal of Machine Learning Research 16, 1731–1755. Swaminathan, A. and T. Joachims (2015b). The self-normalized estimator for counterfactual learning. In Advances in Neural Information Processing Systems 28, pp. 3231–3239. Szepesvari, C. and R. Munos (2005). Finite time bounds for sampling based ﬁtted value iteration. In Proceedings of the 22nd international conference on machine learning, Volume 119 of ICML ’05, pp. 880–887. ACM. Szepesvari, C. (2010). Algorithms for reinforcement learning, Volume 9 of Synthesis digital library of engineering and computer science. San Rafael, Calif.: Morgan amp; Claypool. Viviano, D. (2019). Policy targeting under network interference. arXiv.org. Masatoshi Uehara (Harvard University) OPL December 25, 2019 25 / 26
26. 26. Ref VI Zhang, B., A. A. tsiatis, M. Davidian, M. Zhang, and E. Laber (2016). Estimating optimal treatment regimes from a classiﬁcation perspective. Stat 5, 278–278. Zhang, B., A. A. Tsiatis, E. B. Laber, and M. Davidian (2013). Robust estimation of optimal dynamic treatment regimes for sequential treatment decisions. Biometrika 100, 681–694. Zhao, Y., D. Zeng, A. J. Rush, and M. R. Kosorok (2012). Estimating individualized treatment rules using outcome weighted learning. Journal of the American Statistical Association 107, 1106–1118. Zhao, Y.-Q., E. B. Laber, Y. Ning, S. Saha, and B. E. Sands (2019). Eﬃcient augmentation and relaxation learning for individualized treatment rules using observational data. Journal of machine learning research : JMLR 20. Zhou, Z., S. Athey, and S. Wager (2018). Oﬄine multi-action policy learning: Generalization and optimization. arXiv.org. Masatoshi Uehara (Harvard University) OPL December 25, 2019 26 / 26