Logged user interactions are one of the most ubiquitous forms of data available because they can be recorded from a variety of systems (e.g., search engines, recommender systems, ad placement) at little cost. Naively using this data, however, is prone to failure. A key problem lies in biases that systems inject into the logs by influencing where we will receive feedback (e.g., more clicks at the top of the search ranking). This talk explores how counterfactual inference techniques can make learning algorithms robust against bias. This makes log data accessible to a broad range of learning algorithms, from ranking SVMs to deep networks.
2. Interaction Logs
Interaction logs
Terabytes of log data
Examples
Ad Placement
Search engines
Entertainment media
E-commerce
Smart homes
Tasks
Measure and optimize performance
Gather and maintain knowledge
Personalization
8. Outline
Batch Learning from Bandit Feedback (BLBF)
𝑆 = 𝑥1, 𝑦1, 𝛿1 , … , 𝑥 𝑛, 𝑦𝑛, 𝛿 𝑛
Find new policy 𝜋 that selects 𝑦 with better 𝛿
Learning Principle for BLBF
Hypothesis Space, Risk, Empirical Risk, and Overfitting
Learning Principle: Counterfactual Risk Minimization
Learning Algorithms for BLBF
POEM: Bandit training of CRF policies for structured outputs
BanditNet: Bandit training of deep network policies
Learning from Logged Interventions: BLBF and beyond
9. Online System
Context 𝑥
Patient (e.g. lab tests)
Action 𝑦
Treatment given
Effect 𝛿 𝑥, 𝑦
Survival time
Context 𝑥
User (e.g. query, profile)
Action 𝑦
Ranking, ad, news
Effect 𝛿 𝑥, 𝑦
Clicks, reading time
Medical Treatment
Causal Inference for Action Policy
Controlled Randomized Trial
10. Policy
Given context 𝑥, policy 𝜋 selects action 𝑦 with
probability 𝜋 𝑦 𝑥
Note: stochastic policies ⊃ deterministic policies
𝜋1(𝑌|𝑥) 𝜋2(𝑌|𝑥)
𝑌|𝑥
11. Utility
The expected reward / utility U(𝜋) of policy
𝜋 is U 𝜋 =
𝑥∈𝑋
𝑦∈𝑌
𝛿 𝑥, 𝑦 𝜋 𝑦 𝑥 𝑃 𝑥
𝜋1(𝑌|𝑥) 𝜋2(𝑌|𝑥)
𝑌|𝑥
17. Randomness is Your Friend
Sources of randomness
– 𝜋0 𝑌 𝑥 is random
– Features in 𝑥 are random
– Choice of 𝜋𝑖 𝑌 𝑥 is random
– User behavior is random
– Etc.
𝜋0(𝑌|𝑥𝑖)
𝑌|𝑥𝑖
18. Minimizing Training Error
• Setup
– Log using stochastic 𝜋0
S = 𝑥1, 𝑦1, 𝛿1 , … , 𝑥 𝑛, 𝑦𝑛, 𝛿 𝑛
– Learn new policy 𝜋 ∈ 𝐻
• Training
[Zadrozny et al., 2003] [Langford & Li], [Bottou, et al., 2014]
ො𝜋 ≔ argmax 𝜋∈𝐻
𝑖=1
𝑛
𝜋 𝑦𝑖 𝑥𝑖
𝜋0 𝑦𝑖 𝑥𝑖
𝛿𝑖
𝜋0(𝑌|𝑥)
𝜋1(𝑌|𝑥)
𝜋0(𝑌|𝑥)
𝜋237(𝑌|𝑥)
19. Learning Theory for BLBF
Theorem [Generalization Error Bound]
For any policy space 𝐻 with capacity 𝐶, and for all
𝜋 ∈ 𝐻 with probability 1 − 𝜂
Bound accounts for the fact that variance of risk
estimator can vary greatly between different 𝜋 ∈ H
U 𝜋 ≥ 𝑈 𝜋 − 𝑂
𝑉 𝑎𝑟 𝑈 𝜋
𝑛
− 𝑂(𝐶)
[Swaminathan & Joachims, 2015]
Unbiased
Estimator
Variance
Overfitting
Capacity
Overfitting
21. Outline
Batch Learning from Bandit Feedback (BLBF)
𝑆 = 𝑥1, 𝑦1, 𝛿1 , … , 𝑥 𝑛, 𝑦𝑛, 𝛿 𝑛
Find new policy 𝜋 that selects 𝑦 with better 𝛿
Learning Principle for BLBF
Hypothesis Space, Risk, Empirical Risk, and Overfitting
Learning Principle: Counterfactual Risk Minimization
Learning Algorithms for BLBF
POEM: Bandit training of CRF policies for structured outputs
BanditNet: Bandit training of deep network policies
Learning from Logged Interventions: BLBF and beyond
22. POEM: Policy Space
Policy space
𝜋 𝑤 𝑦 𝑥 =
1
𝑍(𝑥)
exp 𝑤 ⋅ Φ 𝑥, 𝑦
with
– 𝑤: parameter vector to be learned
– Φ 𝑥, 𝑦 : joint feature map between input and output
– Z(x): partition function (i.e. normalizer)
Note: same form as CRF or Structural SVM
24. POEM: Text Classification
Data: Reuters Text Classification
– 𝑆∗ = 𝑥1, 𝑦1
∗
, … , 𝑥 𝑚, 𝑦 𝑚
∗
– Label vectors 𝑦∗
= (𝑦1
, 𝑦2
, 𝑦3
, 𝑦4
)
Results:
Bandit feedback generation:
– Draw document 𝑥𝑖
– Pick 𝑦𝑖 via logging policy 𝜋0 𝑌|𝑥𝑖
– Observe loss 𝛿𝑖 = Hamming(𝑦i, 𝑦𝑖
∗
)
𝑆 = 𝑥1, 𝑦1, 𝛿1, 𝑝1 , … , 𝑥 𝑛, 𝑦 𝑛, 𝛿 𝑛, 𝑝 𝑛
[Joachims et al., 2017]
𝑦𝑖 = 1,0,1,0
𝑝𝑖 = 0.3
𝜋0
𝛿𝑖 = 2
Learning from Logged
Interventions
Every time a system places an ad,
presents a search ranking, or makes
a recommendation, we can think
about this as an intervention for
which we can observe the user's
response (e.g. click, dwell time,
purchase). Such logged
intervention data is actually one of
the most plentiful types of data
available, as it can be recorded
from a variety of
𝑥𝑖
0.28
0.33
0.38
0.43
0.48
0.53
0.58
1 2 3 4 5 6 7 8
HammingLoss
Amount of bandit training data (in epochs)
Series1
Series2
Series3
𝜋0 logging policy
POEM
CRF (supervised)
25. POEM: Real-World Application
Task: Given query 𝑥
– Select answer 𝑦 from 𝑌(𝑥)
(or abstain)
– Maximize correct-incorrect
answers
Existing System:
– Small 𝐷 = 𝑥1, 𝑦1
∗
, …
– Logistic regression scorer
𝑠 𝑦 𝑥 = 𝑤 ⋅ Φ 𝑥, 𝑦
– Select 𝑦 = argmax
𝑌(𝑥)
𝑠 𝑦 𝑥
Bandit Data:
– Field 𝜋0 𝑦 𝑥 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝜆 𝑠(𝑦|𝑥)
– Tune 𝜆 for desired stochasticity
– User feedback 𝛿 ∈ −1,0,1
𝑆 = 𝑥1, 𝑦1, 𝛿1, 𝑝1 , … , 𝑥 𝑛, 𝑦 𝑛, 𝛿 𝑛, 𝑝 𝑛
Results:
– Learn new 𝜋(𝑦|𝑥) via POEM on 𝑆
– Size of S about 5000
Improvement: 30% over existing
[Swaminathan & Joachims, 2015]
26. Outline
Batch Learning from Bandit Feedback (BLBF)
𝑆 = 𝑥1, 𝑦1, 𝛿1 , … , 𝑥 𝑛, 𝑦𝑛, 𝛿 𝑛
Find new policy 𝜋 that selects 𝑦 with better 𝛿
Learning Principle for BLBF
Hypothesis Space, Risk, Empirical Risk, and Overfitting
Learning Principle: Counterfactual Risk Minimization
Learning Algorithms for BLBF
POEM: Bandit training of CRF policies for structured outputs
BanditNet: Bandit training of deep network policies
Learning from Logged Interventions: BLBF and beyond
27. BanditNet: Policy Space
Policy space
𝜋 𝑤 𝑦 𝑥 =
1
𝑍(𝑥)
exp 𝐷𝑒𝑒𝑝𝑁𝑒𝑡(𝑥, 𝑦|𝑤)
with
– 𝑤: parameter tensors to be learned
– Z(x): partition function
Note: same form as Deep Net with softmax output
[Joachims et al., 2017]
30. Outline
Batch Learning from Bandit Feedback (BLBF)
𝑆 = 𝑥1, 𝑦1, 𝛿1 , … , 𝑥 𝑛, 𝑦𝑛, 𝛿 𝑛
Find new policy 𝜋 that selects 𝑦 with better 𝛿
Learning Principle for BLBF
Hypothesis Space, Risk, Empirical Risk, and Overfitting
Learning Principle: Counterfactual Risk Minimization
Learning Algorithms for BLBF
POEM: Bandit training of CRF policies for structured outputs
BanditNet: Bandit training of deep network policies
Learning from Biased Feedback: BLBF and beyond
31. Unbiased Learning-to-Rank from Clicks
Idea: Exploit randomness of users
– SVM PropRank
– Deep PropDCG
– Propensity Estimation
Presented 𝒚 𝟏
A
B
C
D
E
F
G
Click
Presented 𝒚 𝟏
A
B
C
D
E
F
G
Click
Presented 𝒚 𝟏
A
B
C
D
E
F
G
Click
Presented 𝒚 𝟏
A
B
C
D
E
F
G
Click
Presented 𝒚 𝟏
A
B
C
D
E
F
G
Click
Presented 𝒚 𝟏
A
B
C
D
E
F
G
Click
Presented ഥ𝒚 𝒏
A
B
C
D
E
F
G
Click
Click
New Ranker
𝜋(𝑥)
Learning
Algorithm
Query Distribution
𝑥𝑖 ∼ 𝑷(𝑿)
Deployed Ranker
ത𝑦𝑖 = 𝝅 𝟎(𝑥𝑖)
Should perform
better than
𝜋0(𝑥)
[Joachims et al., 2017] [Agarwal et al., 2018]