Unbiased Learning from Biased User Feedback (AIS304) - AWS re:Invent 2018

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
AI Summit
Unbiased Learning from Biased User Feedback
Thorsten Joachims, Cornell University, Amazon Scholar

Interaction Logs
Interaction logs
Terabytes of log data
Examples
Ad Placement
Search engines
Entertainment media
E-commerce
Smart homes
Tasks
Measure and optimize performance
Gather and maintain knowledge
Personalization

Ad Placement
Context 𝑥
User and page
Action 𝑦
Ad that is placed
Effect 𝛿 𝑥, 𝑦
Click / no-click

News
Recommender
Context 𝑥
User
Action 𝑦
Portfolio of newsarticles
Reading time in minutes

Search Engine
Context 𝑥
Query
Action 𝑦
Ranking
Clicks on SERP

Batch Learning from Bandit Feedback
Data: “Bandit” feedback biased by system 𝜋0
𝑆 = 𝑥1, 𝑦1, 𝛿1 , … , 𝑥 𝑛, 𝑦𝑛, 𝛿 𝑛
Goal of Learning
Find new system 𝜋 that selects 𝑦 with better 𝛿
context
𝜋0 action
reward
[Zadrozny et al., 2003] [Langford & Li], [Bottou, et al., 2014]

Learning Settings
Full-Information
(Labeled) Feedback
Partial-Information
(e.g. Bandit) Feedback
Online Learning • Perceptron
• Winnow
• Etc.
• EXP3
• UCB1
• Etc.
Batch Learning • SVM
• Random Forests
• Etc.
?

Outline
Batch Learning from Bandit Feedback (BLBF)
𝑆 = 𝑥1, 𝑦1, 𝛿1 , … , 𝑥 𝑛, 𝑦𝑛, 𝛿 𝑛
 Find new policy 𝜋 that selects 𝑦 with better 𝛿
Learning Principle for BLBF
Hypothesis Space, Risk, Empirical Risk, and Overfitting
Learning Principle: Counterfactual Risk Minimization
Learning Algorithms for BLBF
POEM: Bandit training of CRF policies for structured outputs
BanditNet: Bandit training of deep network policies
Learning from Logged Interventions: BLBF and beyond

Online System
Context 𝑥
Patient (e.g. lab tests)
Action 𝑦
Treatment given
Survival time
Context 𝑥
User (e.g. query, profile)
Action 𝑦
Ranking, ad, news
Clicks, reading time
Medical Treatment
Causal Inference for Action Policy
 Controlled Randomized Trial

Policy
Given context 𝑥, policy 𝜋 selects action 𝑦 with
probability 𝜋 𝑦 𝑥
Note: stochastic policies ⊃ deterministic policies
𝜋1(𝑌|𝑥) 𝜋2(𝑌|𝑥)
𝑌|𝑥

Utility
The expected reward / utility U(𝜋) of policy
𝜋 is U 𝜋 = ෍
𝑥∈𝑋
෍
𝑦∈𝑌
𝛿 𝑥, 𝑦 𝜋 𝑦 𝑥 𝑃 𝑥
𝜋1(𝑌|𝑥) 𝜋2(𝑌|𝑥)
𝑌|𝑥

Estimating Utility
A/B Testing: Field 𝜋 and collect
𝑆 = 𝑥1, 𝑦1, 𝛿1 , … , 𝑥 𝑛, 𝑦𝑛, 𝛿 𝑛
 unbiased estimate of utility
෡𝑈 𝜋 =
1
𝑛
෍
𝑖=1
𝑛
𝛿𝑖
𝜋0(𝑌|𝑥)
𝑌|𝑥

Estimating Utility Offline
• Online A/B Test
• Offline A/B Test  Counterfactual Estimates
Draw 𝑆1
from 𝜋1
 ෡𝑈 𝜋1
Draw 𝑆2
from 𝜋2
 ෡𝑈 𝜋2
Draw 𝑆3
from 𝜋3
 ෡𝑈 𝜋3
Draw 𝑆4
from 𝜋4
 ෡𝑈 𝜋4
Draw 𝑆5
from 𝜋5
 ෡𝑈 𝜋5
Draw 𝑆6
from 𝜋6
 ෡𝑈 𝜋6
Draw 𝑆 from 𝜋0
෡𝑈 ℎ1
෡𝑈 ℎ1
෡𝑈 ℎ1
෡𝑈 ℎ1
෡𝑈 ℎ1
෡𝑈 𝜋6
෡𝑈 ℎ1
෡𝑈 ℎ1
෡𝑈 ℎ1
෡𝑈 ℎ1
෡𝑈 ℎ1
෡𝑈 𝜋12
෡𝑈 ℎ1
෡𝑈 ℎ1
෡𝑈 ℎ1
෡𝑈 ℎ1
෡𝑈 ℎ1
෡𝑈 𝜋18
෡𝑈 ℎ1
෡𝑈 ℎ1
෡𝑈 ℎ1
෡𝑈 ℎ1
෡𝑈 ℎ1
෡𝑈 𝜋24
෡𝑈 ℎ1
෡𝑈 ℎ1
෡𝑈 ℎ1
෡𝑈 ℎ1
෡𝑈 ℎ1
෡𝑈 𝜋30
Draw 𝑆7
from 𝜋7
 ෡𝑈 𝜋7

The Problem with Offline A/B-Tests
 Missing information about 𝛿 𝑥, 𝑦 for actions chosen by new 𝜋
𝜋0(𝑌|𝑥𝑖) 𝜋(𝑌|𝑥𝑖)
𝑌|𝑥𝑖

Approach 1: Reward Predictor
Learn regression model መ𝛿: 𝑥 × 𝑦 → ℜ
Use data 𝑆 = 𝑥1, 𝑦1, 𝛿1 , … , 𝑥 𝑛, 𝑦𝑛, 𝛿 𝑛
with feature vectors Ψ(𝑥, 𝑦)
Impute መ𝛿 𝑥, 𝑦 when missing
መ𝛿(𝑥, 𝑦)
𝛿 𝑥, 𝑦′
Ψ1
Ψ2
U 𝜋 =
1
𝑛
෍
𝑥 𝑖∈𝑆
෍
𝑦∈𝑌
መ𝛿 𝑥𝑖, 𝑦 𝜋 𝑦 𝑥𝑖

Approach 2: Counterfactual Estimator
Given 𝑆 = 𝑥1, 𝑦1, 𝛿1 , … , 𝑥 𝑛, 𝑦𝑛, 𝛿 𝑛 collected under 𝜋0,
 Unbiased estimate of risk, if propensity nonzero
everywhere (where it matters).
෡𝑈 𝜋 =
1
𝑛
෍
𝑖=1
𝑛
𝜋 𝑦𝑖 𝑥𝑖
𝜋0 𝑦𝑖 𝑥𝑖
𝛿𝑖
[Horvitz & Thompson, 1952] [Rubin, 1983] [Zadrozny et al., 2003] [Langford, Li, 2009.]
Propensity
𝑝𝑖
𝜋0(𝑌|𝑥) 𝜋(𝑌|𝑥)

Randomness is Your Friend
Sources of randomness
– 𝜋0 𝑌 𝑥 is random
– Features in 𝑥 are random
– Choice of 𝜋𝑖 𝑌 𝑥 is random
– User behavior is random
– Etc.
𝜋0(𝑌|𝑥𝑖)
𝑌|𝑥𝑖

Minimizing Training Error
• Setup
– Log using stochastic 𝜋0
S = 𝑥1, 𝑦1, 𝛿1 , … , 𝑥 𝑛, 𝑦𝑛, 𝛿 𝑛
– Learn new policy 𝜋 ∈ 𝐻
• Training
[Zadrozny et al., 2003] [Langford & Li], [Bottou, et al., 2014]
ො𝜋 ≔ argmax 𝜋∈𝐻 ෍
𝑖=1
𝑛
𝜋 𝑦𝑖 𝑥𝑖
𝜋0 𝑦𝑖 𝑥𝑖
𝛿𝑖
𝜋0(𝑌|𝑥)
𝜋1(𝑌|𝑥)
𝜋0(𝑌|𝑥)
𝜋237(𝑌|𝑥)

Learning Theory for BLBF
Theorem [Generalization Error Bound]
For any policy space 𝐻 with capacity 𝐶, and for all
𝜋 ∈ 𝐻 with probability 1 − 𝜂
 Bound accounts for the fact that variance of risk
estimator can vary greatly between different 𝜋 ∈ H
U 𝜋 ≥ ෡𝑈 𝜋 − 𝑂
෣𝑉 𝑎𝑟 ෡𝑈 𝜋
𝑛
− 𝑂(𝐶)
[Swaminathan & Joachims, 2015]
Unbiased
Estimator
Variance
Overfitting
Capacity
Overfitting

Counterfactual Risk Minimization
Constructive principle for learning algorithms
 Maximize learning theoretical bound
𝜋 𝑐𝑟𝑚 = argmax
𝜋∈𝐻
෡𝑈 𝜋 − 𝜆1
෢𝑉𝑎𝑟 ෡𝑈 𝜋 /𝑛 − 𝜆2 𝐶(𝐻)
Capacity
Regularization
Variance
Regularization
Training
Error/Utility (IPS)

POEM: Policy Space
Policy space
𝜋 𝑤 𝑦 𝑥 =
1
𝑍(𝑥)
exp 𝑤 ⋅ Φ 𝑥, 𝑦
with
– 𝑤: parameter vector to be learned
– Φ 𝑥, 𝑦 : joint feature map between input and output
– Z(x): partition function (i.e. normalizer)
Note: same form as CRF or Structural SVM

POEM: Learning Method
Policy Optimizer for Exponential Models (POEM)
– Data: 𝑆 = 𝑥1, 𝑦1, 𝛿1, 𝑝1 , … , 𝑥 𝑛, 𝑦𝑛, 𝛿 𝑛, 𝑝 𝑛
– Policy space: 𝜋 𝑤 𝑦 𝑥 = exp 𝑤 ⋅ 𝜙 𝑥, 𝑦 /𝑍(𝑥)
𝑤 = argmax
𝑤∈ℜ 𝑁
෡𝑈 𝜋 𝑤 − 𝜆1
෢𝑉𝑎𝑟 ෡𝑈 𝜋 𝑤 − 𝜆2 𝑤
2
Capacity
Regularization
Variance
Regularization
IPS Estimator

POEM: Text Classification
Data: Reuters Text Classification
– 𝑆∗ = 𝑥1, 𝑦1
∗
, … , 𝑥 𝑚, 𝑦 𝑚
∗
– Label vectors 𝑦∗
= (𝑦1
, 𝑦2
, 𝑦3
, 𝑦4
)
Results:
Bandit feedback generation:
– Draw document 𝑥𝑖
– Pick 𝑦𝑖 via logging policy 𝜋0 𝑌|𝑥𝑖
– Observe loss 𝛿𝑖 = Hamming(𝑦i, 𝑦𝑖
∗
)
 𝑆 = 𝑥1, 𝑦1, 𝛿1, 𝑝1 , … , 𝑥 𝑛, 𝑦 𝑛, 𝛿 𝑛, 𝑝 𝑛
[Joachims et al., 2017]
𝑦𝑖 = 1,0,1,0
𝑝𝑖 = 0.3
𝜋0
𝛿𝑖 = 2
Learning from Logged
Interventions
Every time a system places an ad,
presents a search ranking, or makes
a recommendation, we can think
about this as an intervention for
which we can observe the user's
response (e.g. click, dwell time,
purchase). Such logged
intervention data is actually one of
the most plentiful types of data
available, as it can be recorded
from a variety of
𝑥𝑖
0.28
0.33
0.38
0.43
0.48
0.53
0.58
1 2 3 4 5 6 7 8
HammingLoss
Amount of bandit training data (in epochs)
Series1
Series2
Series3
𝜋0 logging policy
POEM
CRF (supervised)

POEM: Real-World Application
Task: Given query 𝑥
– Select answer 𝑦 from 𝑌(𝑥)
(or abstain)
– Maximize correct-incorrect
answers
Existing System:
– Small 𝐷 = 𝑥1, 𝑦1
∗
, …
– Logistic regression scorer
𝑠 𝑦 𝑥 = 𝑤 ⋅ Φ 𝑥, 𝑦
– Select 𝑦 = argmax
𝑌(𝑥)
𝑠 𝑦 𝑥
Bandit Data:
– Field 𝜋0 𝑦 𝑥 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝜆 𝑠(𝑦|𝑥)
– Tune 𝜆 for desired stochasticity
– User feedback 𝛿 ∈ −1,0,1
 𝑆 = 𝑥1, 𝑦1, 𝛿1, 𝑝1 , … , 𝑥 𝑛, 𝑦 𝑛, 𝛿 𝑛, 𝑝 𝑛
Results:
– Learn new 𝜋(𝑦|𝑥) via POEM on 𝑆
– Size of S about 5000
 Improvement: 30% over existing

BanditNet: Policy Space
Policy space
𝜋 𝑤 𝑦 𝑥 =
1
𝑍(𝑥)
exp 𝐷𝑒𝑒𝑝𝑁𝑒𝑡(𝑥, 𝑦|𝑤)
with
– 𝑤: parameter tensors to be learned
– Z(x): partition function
Note: same form as Deep Net with softmax output

BanditNet: Learning Method
• Deep networks with bandit feedback (BanditNet):
– Data: 𝑆 = 𝑥1, 𝑦1, 𝛿1, 𝑝1 , … , 𝑥 𝑛, 𝑦𝑛, 𝛿 𝑛, 𝑝 𝑛
– Hypotheses: 𝜋 𝑤 𝑦 𝑥 = exp 𝐷𝑒𝑒𝑝𝑁𝑒𝑡 𝑥|𝑤 /𝑍(𝑥)
𝑤 = argmax
𝑤∈ℜ 𝑁
෡𝑈 𝜋 𝑤 − 𝜆1
෢𝑉𝑎𝑟 ෡𝑈 𝜋 𝑤 − 𝜆2 𝑤
2
Capacity
Regularization
Variance
Regularization
Self-Normalized
IPS Estimator

BanditNet: Object Recognition
• Data: CIFAR-10
– 𝑆∗ = 𝑥1, 𝑦1
∗
, … , 𝑥 𝑚, 𝑦 𝑚
∗
– ResNet20 [He et al., 2016]
• Results
• Bandit feedback generation:
– Draw image 𝑥𝑖
– Pick 𝑦𝑖 via logging policy 𝜋0 𝑌|𝑥𝑖
– Observe loss 𝛿𝑖 = 𝑦𝑖 ≠ 𝑦𝑖
∗
 𝑆 = 𝑥1, 𝑦1, 𝛿1, 𝑝1 , … , 𝑥 𝑛, 𝑦 𝑛, 𝛿 𝑛, 𝑝 𝑛
[Beygelzimer & Langford, 2009] [Joachims et al., 2017]
𝑦𝑖 = dog
𝑝𝑖 = 0.3
𝜋0
𝛿𝑖 = 1

Outline
Batch Learning from Bandit Feedback (BLBF)
𝑆 = 𝑥1, 𝑦1, 𝛿1 , … , 𝑥 𝑛, 𝑦𝑛, 𝛿 𝑛
 Find new policy 𝜋 that selects 𝑦 with better 𝛿
Learning Principle for BLBF
Hypothesis Space, Risk, Empirical Risk, and Overfitting
Learning Principle: Counterfactual Risk Minimization
Learning Algorithms for BLBF
POEM: Bandit training of CRF policies for structured outputs
BanditNet: Bandit training of deep network policies
Learning from Biased Feedback: BLBF and beyond

Unbiased Learning-to-Rank from Clicks
Idea: Exploit randomness of users
– SVM PropRank
– Deep PropDCG
– Propensity Estimation
Presented 𝒚 𝟏
A
B
C
D
E
F
G
Click
Presented 𝒚 𝟏
A
B
C
D
E
F
G
Click
Presented 𝒚 𝟏
A
B
C
D
E
F
G
Click
Presented 𝒚 𝟏
A
B
C
D
E
F
G
Click
Presented 𝒚 𝟏
A
B
C
D
E
F
G
Click
Presented 𝒚 𝟏
A
B
C
D
E
F
G
Click
Presented ഥ𝒚 𝒏
A
B
C
D
E
F
G
Click
Click
New Ranker
𝜋(𝑥)
Learning
Algorithm
Query Distribution
𝑥𝑖 ∼ 𝑷(𝑿)
Deployed Ranker
ത𝑦𝑖 = 𝝅 𝟎(𝑥𝑖)
Should perform
better than
𝜋0(𝑥)
[Joachims et al., 2017] [Agarwal et al., 2018]

Matrix Factorization
Missing data and selection bias
• System and user generated
• Unbiased matrix factorization
5 4 5
4 4 3
3 4 4
1 3 1
2 1 2
2 2 1
1 1 3
2 2 2
1 3 1
3 4 4
1 2 1
1 1 3
2 2 2
5 3 5
4 4 4
2 1 2
3 2 3
2 1 1
1 2 2
2 2 1
1 1 1
1 1 1
2 2 2
3 1 1
5 3 5
4 3 3
4 3 4
Users Movies
4 5
4 3
3 4
2
4 4 2
5 3
4 4
1
2
5 3
3 3
4 4
Users
Movies
True Rating Matrix Y* Observed Rating Matrix ෨𝑌
Partially revealed
Predict ෠𝑌 / Policy
Horror Romance Comedy
Romance
lovers
Horror
lovers
Comedy
lovers
[Schnabel et al., 2016]

Conclusions
Learning with Biased Feedback Data
– Evaluation: Offline A/B tests
– Learning: Learn new 𝜋 that selects 𝑦 with better 𝛿
 Causal and Counterfactual Inference
Tutorial, Software, Papers, Data:
www.joachims.org

Unbiased Learning from Biased User Feedback (AIS304) - AWS re:Invent 2018

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Unbiased Learning from Biased User Feedback (AIS304) - AWS re:Invent 2018

Similar to Unbiased Learning from Biased User Feedback (AIS304) - AWS re:Invent 2018 (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Unbiased Learning from Biased User Feedback (AIS304) - AWS re:Invent 2018