This document discusses counterfactual learning methods aimed at improving recommendation systems by directly learning from logged user feedback rather than relying on traditional collaborative filtering techniques. It introduces various methods, such as the dual bandit approach, which optimize recommendations by combining contextual bandit strategies with value-based learning to handle challenges like stochastic and sparse rewards. Empirical experiments demonstrate the effectiveness of the proposed methods in improving recommendation performance across different logging policies.