Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Off policy learning

Off policy learning from a causal inference perspective

  • Login to see the comments

Off policy learning

  1. 1. Off policy learning -survey-1 Masatoshi Uehara (Harvard University) December 25, 2019 1 Disclaimer; this is a very casual note Masatoshi Uehara (Harvard University) OPL December 25, 2019 1 / 26
  2. 2. Overview 1 Motivation 2 Off policy Learning 3 Opening Questions Masatoshi Uehara (Harvard University) OPL December 25, 2019 2 / 26
  3. 3. Goal Off policy evaluation (OPE) Goal is evaluating the value of the policy from the historical data. Off policy learning (a.k.a welfare maximization, counterfactual learning) (Murphy, 2003) Goal is finding the policy maximizing the value of the policy. Application Advertisement Medical treatment Good surveys... (Stats perspective) Chakraborty and Moodie (2013); Kosorok and Moodie (2015); Kosorok and Laber (2019), (RL perspective) Szepesvari (2010) Masatoshi Uehara (Harvard University) OPL December 25, 2019 3 / 26
  4. 4. Formal setting Goal is finding an optimal policy maximizing the value of the policy; E[Y (π(X))] = Yp(y | a, x)π(a|x)p(x)dµ(y, a, x). among π ∈ Π. When Π is a large class, the optimal policy is π∗ (X) = arg max A E[Y |A, X]. In off-policy setting, the available data is (X(i) , Y (i) , A(i) )n i=1 ∼ p(Y | A, X)πb(A | X)p(X). ** Basically, the goal is almost same as the RL. Masatoshi Uehara (Harvard University) OPL December 25, 2019 4 / 26
  5. 5. Regret of the estimator Regret Regret of the estimator ˆπ(X) is defined as E[Y (π∗ (X))] − E[Y (ˆπ(X))], where π∗(X) is an optimal policy π∗ (X) = arg max π∈Π E[Y (π(X))] and ˆπ(X) is an estimator. We want to upper bound this regret by R(Π, n). Masatoshi Uehara (Harvard University) OPL December 25, 2019 5 / 26
  6. 6. IPW way Off-policy learning based on IPW estimator; ˆπ = arg max π∈Π n i=1 π(A(i) | X(i) ) Y (i) πb(A(i) | X(i)) . Literatures Outcome weighted learning (Biostats,stats (Zhao et al., 2012; Zhang et al., 2013)) Welfare maximization (Econ (Kitagawa and Tetenov, 2018b)) Counterfactual risk minimization (ML (Swaminathan and Joachims, 2015a,b)) Masatoshi Uehara (Harvard University) OPL December 25, 2019 6 / 26
  7. 7. Regression way (Based on CATE) CATE is defined as Q(A, X) = E[Y | A, X]. Then, we can find an optimal policy as ˆπ(X) = arg max A ˆQ(A, X). There are many ways to estimate CATE.... Masatoshi Uehara (Harvard University) OPL December 25, 2019 7 / 26
  8. 8. Efficient policy learning Estimator based on an efficient influence function (Dudik et al., 2014; Zhang et al., 2016) is defined as ˆπ = arg max π∈Π n i=1 [π(A(i) | X(i) ) Y (i) − ˆQ(A(i), X(i)) πb(A(i) | X(i)) + a∈A π(a | X(i) ) ˆQ(a, X(i) )]. Athey and Wager (2017); Zhou et al. (2018) generalized this result with tight regret guarantee. The regret is Op( VC(Π) supπ V (π)/n), where V (π) corresponds to a variance term and VC(Π) is a VC-demension of Π. Masatoshi Uehara (Harvard University) OPL December 25, 2019 8 / 26
  9. 9. Policy learning with fast rate (Luedtke and Chambaz, 2017) The aforementioned regret is Op( 1/n). Actually, it it known the regret is op( 1/n) with some mild assumptions (margin conditions Koltchinskii (2006)). If so, is it meaningful to consider the efficiency? (variance term) Yes, this would be related to a high order term. Masatoshi Uehara (Harvard University) OPL December 25, 2019 9 / 26
  10. 10. Policy learning with continuous treatment IPW way me be ill-posed when A is continuous. Kallus and Zhou (2018b) use the following transformation; n i=1 π(A(i) | X(i))Y (i) πb(A(i) | X(i)) = n i=1 π(a | X(i))I(A(i) = a)dµ(a) πb(A(i) | X(i)) ≈ n i=1 π(a | X(i))Kh(A(i), a)dµ(a) πb(A(i) | X(i)) . ** Same as adding the noise the class of policies π ∈ Π (Foster and Syrgkanis, 2019) Masatoshi Uehara (Harvard University) OPL December 25, 2019 10 / 26
  11. 11. Policy learning with surrogate loss functions Indicator loss is non–convex. The optimization is difficult. Since policy learning is an IPW ERM, by replacing the indicator loss with convex surrogate loss functions, the optimization becomes easier (Qian and Murphy, 2011; Zhao et al., 2019). Masatoshi Uehara (Harvard University) OPL December 25, 2019 11 / 26
  12. 12. Orthogonal statistical learning (Foster and Syrgkanis, 2019) Consider a general ERM problem with some plug–in estimator. If the original ERM problem has some orthogonal structure, the there is no plug-in effect. (Basically, fine if the plug-in rate is a half rate of the one of the original problem) Application Treatment Effect Estimation (Partially linear model) (Chernozhukov et al., 2018) Policy learning Domain adaptation (Covariate shift) Remark... Orthogonality and efficiency are different concepts though related. Masatoshi Uehara (Harvard University) OPL December 25, 2019 12 / 26
  13. 13. Policy learning with IV Double robust estimation based on some treatment effect model (Abadie and Imbens, 2006; Okui et al., 2012; Schulte et al., 2015) Applying it to policy learning (Athey and Wager, 2017) Masatoshi Uehara (Harvard University) OPL December 25, 2019 13 / 26
  14. 14. Policy learning for longitudinal data (RL) Assume the data generating process is (X0, A0, Y0, X1, A1, · · · , YT ). Figure: NMDP (Non-Markov Decision process) Figure: TMDP (Time-varying Markov Decision Process) Goal Finding a policy maximizing Eπ[ T t=0 Yt] among π ∈ Π. Masatoshi Uehara (Harvard University) OPL December 25, 2019 14 / 26
  15. 15. Policy learning for longitudinal data (IPW) The natural idea is using a IPW estiamtor for OPE. IPW way (Davidian et al., 2015) {πt}T t=1 = arg max {πt } n i=1 T t=0 t k=0 πk(A (i) k | X (i) k )Yt πb,k(A (i) k | X (i) k ) Masatoshi Uehara (Harvard University) OPL December 25, 2019 15 / 26
  16. 16. Policy learning for longitudinal data (FQI) ** Corresponding a DM FQI (for finite horizon) FQI is defined as an iterative procedure (Szepesvari and Munos, 2005; Chen and Jiang, 2019) ˆQt(x, a) = arg min q∈FQt n i=1 {Y (i) t + arg max a Qt+1(X (i) t+1, a) − ˆQt(X (i) t , A (i) t )}2 , based on ˆQT = 0. The following quantity characterizes the inherent difficulty of FQI; sup g∈F inf f ∈F Bg − f 2, where B is a Bellman–operator for q-functions. Masatoshi Uehara (Harvard University) OPL December 25, 2019 16 / 26
  17. 17. Policy learning for longitudinal data Following a bandit case, IPW and DM method can be extended to a double robust score version (Zhang et al., 2013; Nie et al., 2019). However, the regret will grow exponentially w.r.t the horizon. Can this problem be relaxed under MDP? Masatoshi Uehara (Harvard University) OPL December 25, 2019 17 / 26
  18. 18. Policy learning with unconfounding variables Assume the propensity πb belongs to some weight class W (Missing not at random setting) The idea of Kallus and Zhou (2018a) is finding a robust policy; arg max π∈Π min πb∈W E Y πe(A | X) πb(A | X) . The additional issues are the optimization and regret guarantee. Masatoshi Uehara (Harvard University) OPL December 25, 2019 18 / 26
  19. 19. Other estimators or situations Multiple loggers (stratified sampling) (He et al., 2019) When value function take special (though general enough) form when A is continuous (Chernozhukov et al., 2019) A-learning (closed to partially linear model, structural nested mean models) (Robins, 2004; Schulte et al., 2015) Off-Policy gradient (Assuming policy class is smooth) (Chen et al., 2018) Incorporating Fairness (Corbett-Davies et al., 2017; Kitagawa and Tetenov, 2018a; Metevier et al., 2019) Off-policy learning to rank (Joachims et al., 2017) Finite population inference (Imai and Li, 2019) Masatoshi Uehara (Harvard University) OPL December 25, 2019 19 / 26
  20. 20. Open questions Inference (Laber et al., 2014; Andrews et al., 2018; Jiang et al., 2019) Efficient Off-policy learning for RL...Nuisance functions are related to the policy itself (Nie et al., 2019) Efficient off-policy learning on general DAG with unmeasured variables Off-policy learning on network (Viviano, 2019) Off-policy learning with partial identification Masatoshi Uehara (Harvard University) OPL December 25, 2019 20 / 26
  21. 21. Ref I Abadie, A. and G. W. Imbens (2006). Large sample properties of matching estimators for average treatment effects. Econometrica 74, 235–267. Andrews, I., T. Kitagawa, and A. Mccloskey (2018). Inference on winners. IDEAS Working Paper Series from RePEc. Athey, S. and S. Wager (2017). Efficient policy learning. arXiv preprint arXiv:1702.02896. Chakraborty, B. and E. Moodie (2013). Statistical methods for dynamic treatment regimes. Springer. Chen, J. and N. Jiang (2019). Information-theoretic considerations in batch reinforcement learning. In Proceedings of the 36th International Conference on Machine Learning, Volume 97, pp. 1042–1051. Chen, M., A. Beutel, P. Covington, S. Jain, F. Belletti, and E. Chi (2018). Top-k off-policy correction for a reinforce recommender system. WSDM 2018. Chernozhukov, V., D. Chetverikov, M. Demirer, E. Duflo, C. Hansen, W. Newey, and J. Robins (2018). Double/debiased machine learning for treatment and structural parameters. Econometrics Journal 21, C1–C68. Masatoshi Uehara (Harvard University) OPL December 25, 2019 21 / 26
  22. 22. Ref II Chernozhukov, V., M. Demirer, G. Lewis, and V. Syrgkanis (2019). Semi-parametric efficient policy learning with continuous actions. In Advances in Neural Information Processing Systems 32. Corbett-Davies, S., E. Pierson, A. Feller, S. Goel, and A. Huq (2017). Algorithmic decision making and the cost of fairness. In Proceedings of the 23rd ACM SIGKDD International Conference on knowledge discovery and data mining, Volume 129685 of KDD ’17, pp. 797–806. ACM. Davidian, M., A. Tsiatis, and E. Laber (2015). Value search estimators for optimal dynamic treatment regimes. In Adaptive Treatment Strategies in Practice, pp. 135–155. Society for Industrial and Applied Mathematics. Dudik, M., D. Erhan, J. Langford, and L. Li (2014). Doubly robust policy evaluation and optimization. Statistical Science 29, 485–511. Foster, D. and V. Syrgkanis (2019). Orthogonal statistical learning. arXiv.org. He, L., X. Long, W. Zeng, Y. Zhao, and D. Yin (2019). Off-policy learning for multiple loggers. arXiv.org. Imai, K. and M. Li (2019). Experimental evaluation of individualized treatment rules. arXiv.org. Masatoshi Uehara (Harvard University) OPL December 25, 2019 22 / 26
  23. 23. Ref III Jiang, B., R. Song, J. Li, and D. Zeng (2019). Entropy learning for dynamic treatment regimes. Statistica Sinica 29, 1633–1655. Joachims, T., A. Swaminathan, and T. Schnabel (2017). Unbiased learning-to-rank with biased feedback. In Proceedings of the Tenth ACM International Conference on web search and data mining, WSDM ’17, pp. 781–789. ACM. Kallus, N. and A. Zhou (2018a). Confounding-robust policy improvement. In Advances in Neural Information Processing Systems 31, pp. 9269–9279. Kallus, N. and A. Zhou (2018b). Policy evaluation and optimization with continuous treatments. International Conference on Artificial Intelligence and Statistics, 1243–1251. Kitagawa, T. and A. Tetenov (2018a). Equality-minded treatment choice. IDEAS Working Paper Series from RePEc. Kitagawa, T. and A. Tetenov (2018b). Who should be treated? empirical welfare maximization methods for treatment choice. Econometrica 86, 591–616. Koltchinskii, V. (2006). Local rademacher complexities and oracle inequalities in risk minimization. The Annals of Statistics 34, 2593–2656. Kosorok, M. R. and E. B. Laber (2019). Precision medicine. Annual Re-view of Statistics and Its Application 6, 263–286. Masatoshi Uehara (Harvard University) OPL December 25, 2019 23 / 26
  24. 24. Ref IV Kosorok, M. R. and E. E. M. Moodie (2015). Adaptive Treatment Strategies in Practice. Society for Industrial and Applied Mathematics. Laber, E. B., D. J. Lizotte, M. Qian, W. E. Pelham, and S. A. Murphy (2014). Dynamic treatment regimes: technical challenges and applications. Electronic journal of statistics 8, 1225–1272. Luedtke, A. and A. Chambaz (2017). Faster rates for policy learning. arXiv.org. Metevier, B., S. Giguere, S. Brockman, A. Kobren, Y. Brun, E. Brunskill, and P. S. Thomas (2019). Offline contextual bandits with high probability fairness guarantees. In Advances in Neural Information Processing Systems 32, pp. 14893–14904. Murphy, S. A. (2003). Optimal dynamic treatment regimes. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 65, 331–355. Nie, X., E. Brunskill, and S. Wager (2019). Learning when-to-treat policies. arXiv.org. Okui, R., D. S. Small, Z. Tan, and J. M. Robins (2012). Doubly robust instrumental variable regression. Statistica Sinica 22, 173–205. Qian, M. and S. Murphy (2011). Performance guarantees for individualized treatment rules. Annals of Statistics 39, 1180. Masatoshi Uehara (Harvard University) OPL December 25, 2019 24 / 26
  25. 25. Ref V Robins, J. (2004). Optimal structural nested models for optimal sequentialdecisions. In Proceedings of the Second Seattle Symposium in Biostatistics: Analysis of Correlated Data. Schulte, P., A. Tsiatis, E. Laber, and M. Davidian (2015). q- and a-learning methods for estimating optimal dynamic treatment regimes. Statistical Science, 29. Swaminathan, A. and T. Joachims (2015a). Batch learning from logged bandit feedback through counterfactual risk minimization. Journal of Machine Learning Research 16, 1731–1755. Swaminathan, A. and T. Joachims (2015b). The self-normalized estimator for counterfactual learning. In Advances in Neural Information Processing Systems 28, pp. 3231–3239. Szepesvari, C. and R. Munos (2005). Finite time bounds for sampling based fitted value iteration. In Proceedings of the 22nd international conference on machine learning, Volume 119 of ICML ’05, pp. 880–887. ACM. Szepesvari, C. (2010). Algorithms for reinforcement learning, Volume 9 of Synthesis digital library of engineering and computer science. San Rafael, Calif.: Morgan amp; Claypool. Viviano, D. (2019). Policy targeting under network interference. arXiv.org. Masatoshi Uehara (Harvard University) OPL December 25, 2019 25 / 26
  26. 26. Ref VI Zhang, B., A. A. tsiatis, M. Davidian, M. Zhang, and E. Laber (2016). Estimating optimal treatment regimes from a classification perspective. Stat 5, 278–278. Zhang, B., A. A. Tsiatis, E. B. Laber, and M. Davidian (2013). Robust estimation of optimal dynamic treatment regimes for sequential treatment decisions. Biometrika 100, 681–694. Zhao, Y., D. Zeng, A. J. Rush, and M. R. Kosorok (2012). Estimating individualized treatment rules using outcome weighted learning. Journal of the American Statistical Association 107, 1106–1118. Zhao, Y.-Q., E. B. Laber, Y. Ning, S. Saha, and B. E. Sands (2019). Efficient augmentation and relaxation learning for individualized treatment rules using observational data. Journal of machine learning research : JMLR 20. Zhou, Z., S. Athey, and S. Wager (2018). Offline multi-action policy learning: Generalization and optimization. arXiv.org. Masatoshi Uehara (Harvard University) OPL December 25, 2019 26 / 26

×