SlideShare a Scribd company logo
1 of 33
Download to read offline
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
AI Summit
Unbiased Learning from Biased User Feedback
Thorsten Joachims, Cornell University, Amazon Scholar
Interaction Logs
Interaction logs
Terabytes of log data
Examples
Ad Placement
Search engines
Entertainment media
E-commerce
Smart homes
Tasks
Measure and optimize performance
Gather and maintain knowledge
Personalization
Ad Placement
Context 𝑥
User and page
Action 𝑦
Ad that is placed
Effect 𝛿 𝑥, 𝑦
Click / no-click
News
Recommender
Context 𝑥
User
Action 𝑦
Portfolio of newsarticles
Effect 𝛿 𝑥, 𝑦
Reading time in minutes
Search Engine
Context 𝑥
Query
Action 𝑦
Ranking
Effect 𝛿 𝑥, 𝑦
Clicks on SERP
Batch Learning from Bandit Feedback
Data: “Bandit” feedback biased by system 𝜋0
𝑆 = 𝑥1, 𝑦1, 𝛿1 , … , 𝑥 𝑛, 𝑦𝑛, 𝛿 𝑛
Goal of Learning
Find new system 𝜋 that selects 𝑦 with better 𝛿
context
𝜋0 action
reward
[Zadrozny et al., 2003] [Langford & Li], [Bottou, et al., 2014]
Learning Settings
Full-Information
(Labeled) Feedback
Partial-Information
(e.g. Bandit) Feedback
Online Learning • Perceptron
• Winnow
• Etc.
• EXP3
• UCB1
• Etc.
Batch Learning • SVM
• Random Forests
• Etc.
?
Outline
Batch Learning from Bandit Feedback (BLBF)
𝑆 = 𝑥1, 𝑦1, 𝛿1 , … , 𝑥 𝑛, 𝑦𝑛, 𝛿 𝑛
 Find new policy 𝜋 that selects 𝑦 with better 𝛿
Learning Principle for BLBF
Hypothesis Space, Risk, Empirical Risk, and Overfitting
Learning Principle: Counterfactual Risk Minimization
Learning Algorithms for BLBF
POEM: Bandit training of CRF policies for structured outputs
BanditNet: Bandit training of deep network policies
Learning from Logged Interventions: BLBF and beyond
Online System
Context 𝑥
Patient (e.g. lab tests)
Action 𝑦
Treatment given
Effect 𝛿 𝑥, 𝑦
Survival time
Context 𝑥
User (e.g. query, profile)
Action 𝑦
Ranking, ad, news
Effect 𝛿 𝑥, 𝑦
Clicks, reading time
Medical Treatment
Causal Inference for Action Policy
 Controlled Randomized Trial
Policy
Given context 𝑥, policy 𝜋 selects action 𝑦 with
probability 𝜋 𝑦 𝑥
Note: stochastic policies ⊃ deterministic policies
𝜋1(𝑌|𝑥) 𝜋2(𝑌|𝑥)
𝑌|𝑥
Utility
The expected reward / utility U(𝜋) of policy
𝜋 is U 𝜋 = ෍
𝑥∈𝑋
෍
𝑦∈𝑌
𝛿 𝑥, 𝑦 𝜋 𝑦 𝑥 𝑃 𝑥
𝜋1(𝑌|𝑥) 𝜋2(𝑌|𝑥)
𝑌|𝑥
Estimating Utility
A/B Testing: Field 𝜋 and collect
𝑆 = 𝑥1, 𝑦1, 𝛿1 , … , 𝑥 𝑛, 𝑦𝑛, 𝛿 𝑛
 unbiased estimate of utility
෡𝑈 𝜋 =
1
𝑛
෍
𝑖=1
𝑛
𝛿𝑖
𝜋0(𝑌|𝑥)
𝑌|𝑥
Estimating Utility Offline
• Online A/B Test
• Offline A/B Test  Counterfactual Estimates
Draw 𝑆1
from 𝜋1
 ෡𝑈 𝜋1
Draw 𝑆2
from 𝜋2
 ෡𝑈 𝜋2
Draw 𝑆3
from 𝜋3
 ෡𝑈 𝜋3
Draw 𝑆4
from 𝜋4
 ෡𝑈 𝜋4
Draw 𝑆5
from 𝜋5
 ෡𝑈 𝜋5
Draw 𝑆6
from 𝜋6
 ෡𝑈 𝜋6
Draw 𝑆 from 𝜋0
෡𝑈 ℎ1
෡𝑈 ℎ1
෡𝑈 ℎ1
෡𝑈 ℎ1
෡𝑈 ℎ1
෡𝑈 𝜋6
෡𝑈 ℎ1
෡𝑈 ℎ1
෡𝑈 ℎ1
෡𝑈 ℎ1
෡𝑈 ℎ1
෡𝑈 𝜋12
෡𝑈 ℎ1
෡𝑈 ℎ1
෡𝑈 ℎ1
෡𝑈 ℎ1
෡𝑈 ℎ1
෡𝑈 𝜋18
෡𝑈 ℎ1
෡𝑈 ℎ1
෡𝑈 ℎ1
෡𝑈 ℎ1
෡𝑈 ℎ1
෡𝑈 𝜋24
෡𝑈 ℎ1
෡𝑈 ℎ1
෡𝑈 ℎ1
෡𝑈 ℎ1
෡𝑈 ℎ1
෡𝑈 𝜋30
Draw 𝑆7
from 𝜋7
 ෡𝑈 𝜋7
The Problem with Offline A/B-Tests
 Missing information about 𝛿 𝑥, 𝑦 for actions chosen by new 𝜋
𝜋0(𝑌|𝑥𝑖) 𝜋(𝑌|𝑥𝑖)
𝑌|𝑥𝑖
Approach 1: Reward Predictor
Learn regression model መ𝛿: 𝑥 × 𝑦 → ℜ
Use data 𝑆 = 𝑥1, 𝑦1, 𝛿1 , … , 𝑥 𝑛, 𝑦𝑛, 𝛿 𝑛
with feature vectors Ψ(𝑥, 𝑦)
Impute መ𝛿 𝑥, 𝑦 when missing
መ𝛿(𝑥, 𝑦)
𝛿 𝑥, 𝑦′
Ψ1
Ψ2
U 𝜋 =
1
𝑛
෍
𝑥 𝑖∈𝑆
෍
𝑦∈𝑌
መ𝛿 𝑥𝑖, 𝑦 𝜋 𝑦 𝑥𝑖
Approach 2: Counterfactual Estimator
Given 𝑆 = 𝑥1, 𝑦1, 𝛿1 , … , 𝑥 𝑛, 𝑦𝑛, 𝛿 𝑛 collected under 𝜋0,
 Unbiased estimate of risk, if propensity nonzero
everywhere (where it matters).
෡𝑈 𝜋 =
1
𝑛
෍
𝑖=1
𝑛
𝜋 𝑦𝑖 𝑥𝑖
𝜋0 𝑦𝑖 𝑥𝑖
𝛿𝑖
[Horvitz & Thompson, 1952] [Rubin, 1983] [Zadrozny et al., 2003] [Langford, Li, 2009.]
Propensity
𝑝𝑖
𝜋0(𝑌|𝑥) 𝜋(𝑌|𝑥)
Randomness is Your Friend
Sources of randomness
– 𝜋0 𝑌 𝑥 is random
– Features in 𝑥 are random
– Choice of 𝜋𝑖 𝑌 𝑥 is random
– User behavior is random
– Etc.
𝜋0(𝑌|𝑥𝑖)
𝑌|𝑥𝑖
Minimizing Training Error
• Setup
– Log using stochastic 𝜋0
S = 𝑥1, 𝑦1, 𝛿1 , … , 𝑥 𝑛, 𝑦𝑛, 𝛿 𝑛
– Learn new policy 𝜋 ∈ 𝐻
• Training
[Zadrozny et al., 2003] [Langford & Li], [Bottou, et al., 2014]
ො𝜋 ≔ argmax 𝜋∈𝐻 ෍
𝑖=1
𝑛
𝜋 𝑦𝑖 𝑥𝑖
𝜋0 𝑦𝑖 𝑥𝑖
𝛿𝑖
𝜋0(𝑌|𝑥)
𝜋1(𝑌|𝑥)
𝜋0(𝑌|𝑥)
𝜋237(𝑌|𝑥)
Learning Theory for BLBF
Theorem [Generalization Error Bound]
For any policy space 𝐻 with capacity 𝐶, and for all
𝜋 ∈ 𝐻 with probability 1 − 𝜂
 Bound accounts for the fact that variance of risk
estimator can vary greatly between different 𝜋 ∈ H
U 𝜋 ≥ ෡𝑈 𝜋 − 𝑂
෣𝑉 𝑎𝑟 ෡𝑈 𝜋
𝑛
− 𝑂(𝐶)
[Swaminathan & Joachims, 2015]
Unbiased
Estimator
Variance
Overfitting
Capacity
Overfitting
Counterfactual Risk Minimization
Constructive principle for learning algorithms
 Maximize learning theoretical bound
[Swaminathan & Joachims, 2015]
𝜋 𝑐𝑟𝑚 = argmax
𝜋∈𝐻
෡𝑈 𝜋 − 𝜆1
෢𝑉𝑎𝑟 ෡𝑈 𝜋 /𝑛 − 𝜆2 𝐶(𝐻)
Capacity
Regularization
Variance
Regularization
Training
Error/Utility (IPS)
Outline
Batch Learning from Bandit Feedback (BLBF)
𝑆 = 𝑥1, 𝑦1, 𝛿1 , … , 𝑥 𝑛, 𝑦𝑛, 𝛿 𝑛
 Find new policy 𝜋 that selects 𝑦 with better 𝛿
Learning Principle for BLBF
Hypothesis Space, Risk, Empirical Risk, and Overfitting
Learning Principle: Counterfactual Risk Minimization
Learning Algorithms for BLBF
POEM: Bandit training of CRF policies for structured outputs
BanditNet: Bandit training of deep network policies
Learning from Logged Interventions: BLBF and beyond
POEM: Policy Space
Policy space
𝜋 𝑤 𝑦 𝑥 =
1
𝑍(𝑥)
exp 𝑤 ⋅ Φ 𝑥, 𝑦
with
– 𝑤: parameter vector to be learned
– Φ 𝑥, 𝑦 : joint feature map between input and output
– Z(x): partition function (i.e. normalizer)
Note: same form as CRF or Structural SVM
POEM: Learning Method
Policy Optimizer for Exponential Models (POEM)
– Data: 𝑆 = 𝑥1, 𝑦1, 𝛿1, 𝑝1 , … , 𝑥 𝑛, 𝑦𝑛, 𝛿 𝑛, 𝑝 𝑛
– Policy space: 𝜋 𝑤 𝑦 𝑥 = exp 𝑤 ⋅ 𝜙 𝑥, 𝑦 /𝑍(𝑥)
[Swaminathan & Joachims, 2015]
𝑤 = argmax
𝑤∈ℜ 𝑁
෡𝑈 𝜋 𝑤 − 𝜆1
෢𝑉𝑎𝑟 ෡𝑈 𝜋 𝑤 − 𝜆2 𝑤
2
Capacity
Regularization
Variance
Regularization
IPS Estimator
POEM: Text Classification
Data: Reuters Text Classification
– 𝑆∗ = 𝑥1, 𝑦1
∗
, … , 𝑥 𝑚, 𝑦 𝑚
∗
– Label vectors 𝑦∗
= (𝑦1
, 𝑦2
, 𝑦3
, 𝑦4
)
Results:
Bandit feedback generation:
– Draw document 𝑥𝑖
– Pick 𝑦𝑖 via logging policy 𝜋0 𝑌|𝑥𝑖
– Observe loss 𝛿𝑖 = Hamming(𝑦i, 𝑦𝑖
∗
)
 𝑆 = 𝑥1, 𝑦1, 𝛿1, 𝑝1 , … , 𝑥 𝑛, 𝑦 𝑛, 𝛿 𝑛, 𝑝 𝑛
[Joachims et al., 2017]
𝑦𝑖 = 1,0,1,0
𝑝𝑖 = 0.3
𝜋0
𝛿𝑖 = 2
Learning from Logged
Interventions
Every time a system places an ad,
presents a search ranking, or makes
a recommendation, we can think
about this as an intervention for
which we can observe the user's
response (e.g. click, dwell time,
purchase). Such logged
intervention data is actually one of
the most plentiful types of data
available, as it can be recorded
from a variety of
𝑥𝑖
0.28
0.33
0.38
0.43
0.48
0.53
0.58
1 2 3 4 5 6 7 8
HammingLoss
Amount of bandit training data (in epochs)
Series1
Series2
Series3
𝜋0 logging policy
POEM
CRF (supervised)
POEM: Real-World Application
Task: Given query 𝑥
– Select answer 𝑦 from 𝑌(𝑥)
(or abstain)
– Maximize correct-incorrect
answers
Existing System:
– Small 𝐷 = 𝑥1, 𝑦1
∗
, …
– Logistic regression scorer
𝑠 𝑦 𝑥 = 𝑤 ⋅ Φ 𝑥, 𝑦
– Select 𝑦 = argmax
𝑌(𝑥)
𝑠 𝑦 𝑥
Bandit Data:
– Field 𝜋0 𝑦 𝑥 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝜆 𝑠(𝑦|𝑥)
– Tune 𝜆 for desired stochasticity
– User feedback 𝛿 ∈ −1,0,1
 𝑆 = 𝑥1, 𝑦1, 𝛿1, 𝑝1 , … , 𝑥 𝑛, 𝑦 𝑛, 𝛿 𝑛, 𝑝 𝑛
Results:
– Learn new 𝜋(𝑦|𝑥) via POEM on 𝑆
– Size of S about 5000
 Improvement: 30% over existing
[Swaminathan & Joachims, 2015]
Outline
Batch Learning from Bandit Feedback (BLBF)
𝑆 = 𝑥1, 𝑦1, 𝛿1 , … , 𝑥 𝑛, 𝑦𝑛, 𝛿 𝑛
 Find new policy 𝜋 that selects 𝑦 with better 𝛿
Learning Principle for BLBF
Hypothesis Space, Risk, Empirical Risk, and Overfitting
Learning Principle: Counterfactual Risk Minimization
Learning Algorithms for BLBF
POEM: Bandit training of CRF policies for structured outputs
BanditNet: Bandit training of deep network policies
Learning from Logged Interventions: BLBF and beyond
BanditNet: Policy Space
Policy space
𝜋 𝑤 𝑦 𝑥 =
1
𝑍(𝑥)
exp 𝐷𝑒𝑒𝑝𝑁𝑒𝑡(𝑥, 𝑦|𝑤)
with
– 𝑤: parameter tensors to be learned
– Z(x): partition function
Note: same form as Deep Net with softmax output
[Joachims et al., 2017]
BanditNet: Learning Method
• Deep networks with bandit feedback (BanditNet):
– Data: 𝑆 = 𝑥1, 𝑦1, 𝛿1, 𝑝1 , … , 𝑥 𝑛, 𝑦𝑛, 𝛿 𝑛, 𝑝 𝑛
– Hypotheses: 𝜋 𝑤 𝑦 𝑥 = exp 𝐷𝑒𝑒𝑝𝑁𝑒𝑡 𝑥|𝑤 /𝑍(𝑥)
[Joachims et al., 2018]
𝑤 = argmax
𝑤∈ℜ 𝑁
෡𝑈 𝜋 𝑤 − 𝜆1
෢𝑉𝑎𝑟 ෡𝑈 𝜋 𝑤 − 𝜆2 𝑤
2
Capacity
Regularization
Variance
Regularization
Self-Normalized
IPS Estimator
BanditNet: Object Recognition
• Data: CIFAR-10
– 𝑆∗ = 𝑥1, 𝑦1
∗
, … , 𝑥 𝑚, 𝑦 𝑚
∗
– ResNet20 [He et al., 2016]
• Results
• Bandit feedback generation:
– Draw image 𝑥𝑖
– Pick 𝑦𝑖 via logging policy 𝜋0 𝑌|𝑥𝑖
– Observe loss 𝛿𝑖 = 𝑦𝑖 ≠ 𝑦𝑖
∗
 𝑆 = 𝑥1, 𝑦1, 𝛿1, 𝑝1 , … , 𝑥 𝑛, 𝑦 𝑛, 𝛿 𝑛, 𝑝 𝑛
[Beygelzimer & Langford, 2009] [Joachims et al., 2017]
𝑦𝑖 = dog
𝑝𝑖 = 0.3
𝜋0
𝛿𝑖 = 1
Outline
Batch Learning from Bandit Feedback (BLBF)
𝑆 = 𝑥1, 𝑦1, 𝛿1 , … , 𝑥 𝑛, 𝑦𝑛, 𝛿 𝑛
 Find new policy 𝜋 that selects 𝑦 with better 𝛿
Learning Principle for BLBF
Hypothesis Space, Risk, Empirical Risk, and Overfitting
Learning Principle: Counterfactual Risk Minimization
Learning Algorithms for BLBF
POEM: Bandit training of CRF policies for structured outputs
BanditNet: Bandit training of deep network policies
Learning from Biased Feedback: BLBF and beyond
Unbiased Learning-to-Rank from Clicks
Idea: Exploit randomness of users
– SVM PropRank
– Deep PropDCG
– Propensity Estimation
Presented 𝒚 𝟏
A
B
C
D
E
F
G
Click
Presented 𝒚 𝟏
A
B
C
D
E
F
G
Click
Presented 𝒚 𝟏
A
B
C
D
E
F
G
Click
Presented 𝒚 𝟏
A
B
C
D
E
F
G
Click
Presented 𝒚 𝟏
A
B
C
D
E
F
G
Click
Presented 𝒚 𝟏
A
B
C
D
E
F
G
Click
Presented ഥ𝒚 𝒏
A
B
C
D
E
F
G
Click
Click
New Ranker
𝜋(𝑥)
Learning
Algorithm
Query Distribution
𝑥𝑖 ∼ 𝑷(𝑿)
Deployed Ranker
ത𝑦𝑖 = 𝝅 𝟎(𝑥𝑖)
Should perform
better than
𝜋0(𝑥)
[Joachims et al., 2017] [Agarwal et al., 2018]
Matrix Factorization
Missing data and selection bias
• System and user generated
• Unbiased matrix factorization
5 4 5
4 4 3
3 4 4
1 3 1
2 1 2
2 2 1
1 1 3
2 2 2
1 3 1
3 4 4
1 2 1
1 1 3
2 2 2
5 3 5
4 4 4
2 1 2
3 2 3
2 1 1
1 2 2
2 2 1
1 1 1
1 1 1
2 2 2
3 1 1
5 3 5
4 3 3
4 3 4
Users Movies
4 5
4 3
3 4
2
4 4 2
5 3
4 4
1
2
5 3
3 3
4 4
Users
Movies
True Rating Matrix Y* Observed Rating Matrix ෨𝑌
Partially revealed
Predict ෠𝑌 / Policy
Horror Romance Comedy
Romance
lovers
Horror
lovers
Comedy
lovers
[Schnabel et al., 2016]
Conclusions
Learning with Biased Feedback Data
– Evaluation: Offline A/B tests
– Learning: Learn new 𝜋 that selects 𝑦 with better 𝛿
 Causal and Counterfactual Inference
Tutorial, Software, Papers, Data:
www.joachims.org

More Related Content

What's hot

Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learningDongHyun Kwak
 
An Introduction to Reinforcement Learning - The Doors to AGI
An Introduction to Reinforcement Learning - The Doors to AGIAn Introduction to Reinforcement Learning - The Doors to AGI
An Introduction to Reinforcement Learning - The Doors to AGIAnirban Santara
 
Machine Learning 1 - Introduction
Machine Learning 1 - IntroductionMachine Learning 1 - Introduction
Machine Learning 1 - Introductionbutest
 
Multi PPT - Agent Actor-Critic for Mixed Cooperative-Competitive Environments
Multi PPT - Agent Actor-Critic for Mixed Cooperative-Competitive EnvironmentsMulti PPT - Agent Actor-Critic for Mixed Cooperative-Competitive Environments
Multi PPT - Agent Actor-Critic for Mixed Cooperative-Competitive EnvironmentsJisang Yoon
 
مدخل إلى تعلم الآلة
مدخل إلى تعلم الآلةمدخل إلى تعلم الآلة
مدخل إلى تعلم الآلةFares Al-Qunaieer
 
Reinforcement Learning Q-Learning
Reinforcement Learning   Q-Learning Reinforcement Learning   Q-Learning
Reinforcement Learning Q-Learning Melaku Eneayehu
 
Reinforcement learning
Reinforcement learning Reinforcement learning
Reinforcement learning Chandra Meena
 
Machine Learning Lecture 3 Decision Trees
Machine Learning Lecture 3 Decision TreesMachine Learning Lecture 3 Decision Trees
Machine Learning Lecture 3 Decision Treesananth
 
Facebook Talk at Netflix ML Platform meetup Sep 2019
Facebook Talk at Netflix ML Platform meetup Sep 2019Facebook Talk at Netflix ML Platform meetup Sep 2019
Facebook Talk at Netflix ML Platform meetup Sep 2019Faisal Siddiqi
 
Reinforcement learning 7313
Reinforcement learning 7313Reinforcement learning 7313
Reinforcement learning 7313Slideshare
 
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningNAVER Engineering
 
Algorithms Design Patterns
Algorithms Design PatternsAlgorithms Design Patterns
Algorithms Design PatternsAshwin Shiv
 
Introdution and designing a learning system
Introdution and designing a learning systemIntrodution and designing a learning system
Introdution and designing a learning systemswapnac12
 
Distributed Deep Q-Learning
Distributed Deep Q-LearningDistributed Deep Q-Learning
Distributed Deep Q-LearningLyft
 
Machine learning Algorithms with a Sagemaker demo
Machine learning Algorithms with a Sagemaker demoMachine learning Algorithms with a Sagemaker demo
Machine learning Algorithms with a Sagemaker demoHridyesh Bisht
 
Planning and Learning with Tabular Methods
Planning and Learning with Tabular MethodsPlanning and Learning with Tabular Methods
Planning and Learning with Tabular MethodsDongmin Lee
 
QMIX: monotonic value function factorization paper review
QMIX: monotonic value function factorization paper reviewQMIX: monotonic value function factorization paper review
QMIX: monotonic value function factorization paper review민재 정
 
An introduction to reinforcement learning
An introduction to reinforcement learningAn introduction to reinforcement learning
An introduction to reinforcement learningSubrat Panda, PhD
 

What's hot (20)

Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
An Introduction to Reinforcement Learning - The Doors to AGI
An Introduction to Reinforcement Learning - The Doors to AGIAn Introduction to Reinforcement Learning - The Doors to AGI
An Introduction to Reinforcement Learning - The Doors to AGI
 
Machine Learning 1 - Introduction
Machine Learning 1 - IntroductionMachine Learning 1 - Introduction
Machine Learning 1 - Introduction
 
Multi PPT - Agent Actor-Critic for Mixed Cooperative-Competitive Environments
Multi PPT - Agent Actor-Critic for Mixed Cooperative-Competitive EnvironmentsMulti PPT - Agent Actor-Critic for Mixed Cooperative-Competitive Environments
Multi PPT - Agent Actor-Critic for Mixed Cooperative-Competitive Environments
 
مدخل إلى تعلم الآلة
مدخل إلى تعلم الآلةمدخل إلى تعلم الآلة
مدخل إلى تعلم الآلة
 
Reinforcement Learning Q-Learning
Reinforcement Learning   Q-Learning Reinforcement Learning   Q-Learning
Reinforcement Learning Q-Learning
 
Reinforcement learning
Reinforcement learning Reinforcement learning
Reinforcement learning
 
Machine Learning Lecture 3 Decision Trees
Machine Learning Lecture 3 Decision TreesMachine Learning Lecture 3 Decision Trees
Machine Learning Lecture 3 Decision Trees
 
Facebook Talk at Netflix ML Platform meetup Sep 2019
Facebook Talk at Netflix ML Platform meetup Sep 2019Facebook Talk at Netflix ML Platform meetup Sep 2019
Facebook Talk at Netflix ML Platform meetup Sep 2019
 
ddpg seminar
ddpg seminarddpg seminar
ddpg seminar
 
Reinforcement learning 7313
Reinforcement learning 7313Reinforcement learning 7313
Reinforcement learning 7313
 
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement Learning
 
Algorithms Design Patterns
Algorithms Design PatternsAlgorithms Design Patterns
Algorithms Design Patterns
 
Introdution and designing a learning system
Introdution and designing a learning systemIntrodution and designing a learning system
Introdution and designing a learning system
 
Distributed Deep Q-Learning
Distributed Deep Q-LearningDistributed Deep Q-Learning
Distributed Deep Q-Learning
 
Machine learning Algorithms with a Sagemaker demo
Machine learning Algorithms with a Sagemaker demoMachine learning Algorithms with a Sagemaker demo
Machine learning Algorithms with a Sagemaker demo
 
Planning and Learning with Tabular Methods
Planning and Learning with Tabular MethodsPlanning and Learning with Tabular Methods
Planning and Learning with Tabular Methods
 
QMIX: monotonic value function factorization paper review
QMIX: monotonic value function factorization paper reviewQMIX: monotonic value function factorization paper review
QMIX: monotonic value function factorization paper review
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
An introduction to reinforcement learning
An introduction to reinforcement learningAn introduction to reinforcement learning
An introduction to reinforcement learning
 

Similar to Unbiased Learning from Biased User Feedback (AIS304) - AWS re:Invent 2018

Jsai final final final
Jsai final final finalJsai final final final
Jsai final final finaldinesh malla
 
Recommendation algorithm using reinforcement learning
Recommendation algorithm using reinforcement learningRecommendation algorithm using reinforcement learning
Recommendation algorithm using reinforcement learningArithmer Inc.
 
Online learning & adaptive game playing
Online learning & adaptive game playingOnline learning & adaptive game playing
Online learning & adaptive game playingSaeid Ghafouri
 
[GAN by Hung-yi Lee]Part 2: The application of GAN to speech and text processing
[GAN by Hung-yi Lee]Part 2: The application of GAN to speech and text processing[GAN by Hung-yi Lee]Part 2: The application of GAN to speech and text processing
[GAN by Hung-yi Lee]Part 2: The application of GAN to speech and text processingNAVER Engineering
 
Introduction to sentiment analysis
Introduction to sentiment analysisIntroduction to sentiment analysis
Introduction to sentiment analysisRajesh Piryani
 
Linear Probability Models and Big Data: Prediction, Inference and Selection Bias
Linear Probability Models and Big Data: Prediction, Inference and Selection BiasLinear Probability Models and Big Data: Prediction, Inference and Selection Bias
Linear Probability Models and Big Data: Prediction, Inference and Selection BiasSuneel Babu Chatla
 
Linear, Machine Learning or Probabilistic Predictive Models: What's Best for ...
Linear, Machine Learning or Probabilistic Predictive Models: What's Best for ...Linear, Machine Learning or Probabilistic Predictive Models: What's Best for ...
Linear, Machine Learning or Probabilistic Predictive Models: What's Best for ...Bohdan Pavlyshenko
 
Review: [SIGIR'22]Interpolative Distillation for Unifying Biased and Debiased...
Review: [SIGIR'22]Interpolative Distillation for Unifying Biased and Debiased...Review: [SIGIR'22]Interpolative Distillation for Unifying Biased and Debiased...
Review: [SIGIR'22]Interpolative Distillation for Unifying Biased and Debiased...CS Kwak
 
Artificial intelligence and IoT
Artificial intelligence and IoTArtificial intelligence and IoT
Artificial intelligence and IoTVeselin Pizurica
 
Conistency of random forests
Conistency of random forestsConistency of random forests
Conistency of random forestsHoang Nguyen
 
Recommender Systems Fairness Evaluation via Generalized Cross Entropy
Recommender Systems Fairness Evaluation via Generalized Cross EntropyRecommender Systems Fairness Evaluation via Generalized Cross Entropy
Recommender Systems Fairness Evaluation via Generalized Cross EntropyVito Walter Anelli
 
Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017
Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017
Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017MLconf
 
Counterfactual Learning for Recommendation
Counterfactual Learning for RecommendationCounterfactual Learning for Recommendation
Counterfactual Learning for RecommendationOlivier Jeunen
 
User Payment Prediction in Free-to-Play
User Payment Prediction in Free-to-PlayUser Payment Prediction in Free-to-Play
User Payment Prediction in Free-to-PlayAhmed Hassan
 
Genetic Algorithms.ppt
Genetic Algorithms.pptGenetic Algorithms.ppt
Genetic Algorithms.pptRohanBorgalli
 
Automatic Model Tuning Using Amazon SageMaker (AIM412) - AWS re:Invent 2018
Automatic Model Tuning Using Amazon SageMaker (AIM412) - AWS re:Invent 2018Automatic Model Tuning Using Amazon SageMaker (AIM412) - AWS re:Invent 2018
Automatic Model Tuning Using Amazon SageMaker (AIM412) - AWS re:Invent 2018Amazon Web Services
 
LinkedIn talk at Netflix ML Platform meetup Sep 2019
LinkedIn talk at Netflix ML Platform meetup Sep 2019LinkedIn talk at Netflix ML Platform meetup Sep 2019
LinkedIn talk at Netflix ML Platform meetup Sep 2019Faisal Siddiqi
 
Artificial Intelligence Course: Linear models
Artificial Intelligence Course: Linear models Artificial Intelligence Course: Linear models
Artificial Intelligence Course: Linear models ananth
 
Recommender Systems: Advances in Collaborative Filtering
Recommender Systems: Advances in Collaborative FilteringRecommender Systems: Advances in Collaborative Filtering
Recommender Systems: Advances in Collaborative FilteringChangsung Moon
 

Similar to Unbiased Learning from Biased User Feedback (AIS304) - AWS re:Invent 2018 (20)

Jsai final final final
Jsai final final finalJsai final final final
Jsai final final final
 
Recommendation algorithm using reinforcement learning
Recommendation algorithm using reinforcement learningRecommendation algorithm using reinforcement learning
Recommendation algorithm using reinforcement learning
 
Online learning & adaptive game playing
Online learning & adaptive game playingOnline learning & adaptive game playing
Online learning & adaptive game playing
 
[GAN by Hung-yi Lee]Part 2: The application of GAN to speech and text processing
[GAN by Hung-yi Lee]Part 2: The application of GAN to speech and text processing[GAN by Hung-yi Lee]Part 2: The application of GAN to speech and text processing
[GAN by Hung-yi Lee]Part 2: The application of GAN to speech and text processing
 
Introduction to sentiment analysis
Introduction to sentiment analysisIntroduction to sentiment analysis
Introduction to sentiment analysis
 
Linear Probability Models and Big Data: Prediction, Inference and Selection Bias
Linear Probability Models and Big Data: Prediction, Inference and Selection BiasLinear Probability Models and Big Data: Prediction, Inference and Selection Bias
Linear Probability Models and Big Data: Prediction, Inference and Selection Bias
 
Linear, Machine Learning or Probabilistic Predictive Models: What's Best for ...
Linear, Machine Learning or Probabilistic Predictive Models: What's Best for ...Linear, Machine Learning or Probabilistic Predictive Models: What's Best for ...
Linear, Machine Learning or Probabilistic Predictive Models: What's Best for ...
 
Review: [SIGIR'22]Interpolative Distillation for Unifying Biased and Debiased...
Review: [SIGIR'22]Interpolative Distillation for Unifying Biased and Debiased...Review: [SIGIR'22]Interpolative Distillation for Unifying Biased and Debiased...
Review: [SIGIR'22]Interpolative Distillation for Unifying Biased and Debiased...
 
Artificial intelligence and IoT
Artificial intelligence and IoTArtificial intelligence and IoT
Artificial intelligence and IoT
 
Conistency of random forests
Conistency of random forestsConistency of random forests
Conistency of random forests
 
Recommender Systems Fairness Evaluation via Generalized Cross Entropy
Recommender Systems Fairness Evaluation via Generalized Cross EntropyRecommender Systems Fairness Evaluation via Generalized Cross Entropy
Recommender Systems Fairness Evaluation via Generalized Cross Entropy
 
Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017
Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017
Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017
 
Counterfactual Learning for Recommendation
Counterfactual Learning for RecommendationCounterfactual Learning for Recommendation
Counterfactual Learning for Recommendation
 
ngboost.pptx
ngboost.pptxngboost.pptx
ngboost.pptx
 
User Payment Prediction in Free-to-Play
User Payment Prediction in Free-to-PlayUser Payment Prediction in Free-to-Play
User Payment Prediction in Free-to-Play
 
Genetic Algorithms.ppt
Genetic Algorithms.pptGenetic Algorithms.ppt
Genetic Algorithms.ppt
 
Automatic Model Tuning Using Amazon SageMaker (AIM412) - AWS re:Invent 2018
Automatic Model Tuning Using Amazon SageMaker (AIM412) - AWS re:Invent 2018Automatic Model Tuning Using Amazon SageMaker (AIM412) - AWS re:Invent 2018
Automatic Model Tuning Using Amazon SageMaker (AIM412) - AWS re:Invent 2018
 
LinkedIn talk at Netflix ML Platform meetup Sep 2019
LinkedIn talk at Netflix ML Platform meetup Sep 2019LinkedIn talk at Netflix ML Platform meetup Sep 2019
LinkedIn talk at Netflix ML Platform meetup Sep 2019
 
Artificial Intelligence Course: Linear models
Artificial Intelligence Course: Linear models Artificial Intelligence Course: Linear models
Artificial Intelligence Course: Linear models
 
Recommender Systems: Advances in Collaborative Filtering
Recommender Systems: Advances in Collaborative FilteringRecommender Systems: Advances in Collaborative Filtering
Recommender Systems: Advances in Collaborative Filtering
 

More from Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

More from Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Unbiased Learning from Biased User Feedback (AIS304) - AWS re:Invent 2018

  • 1. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. AI Summit Unbiased Learning from Biased User Feedback Thorsten Joachims, Cornell University, Amazon Scholar
  • 2. Interaction Logs Interaction logs Terabytes of log data Examples Ad Placement Search engines Entertainment media E-commerce Smart homes Tasks Measure and optimize performance Gather and maintain knowledge Personalization
  • 3. Ad Placement Context 𝑥 User and page Action 𝑦 Ad that is placed Effect 𝛿 𝑥, 𝑦 Click / no-click
  • 4. News Recommender Context 𝑥 User Action 𝑦 Portfolio of newsarticles Effect 𝛿 𝑥, 𝑦 Reading time in minutes
  • 5. Search Engine Context 𝑥 Query Action 𝑦 Ranking Effect 𝛿 𝑥, 𝑦 Clicks on SERP
  • 6. Batch Learning from Bandit Feedback Data: “Bandit” feedback biased by system 𝜋0 𝑆 = 𝑥1, 𝑦1, 𝛿1 , … , 𝑥 𝑛, 𝑦𝑛, 𝛿 𝑛 Goal of Learning Find new system 𝜋 that selects 𝑦 with better 𝛿 context 𝜋0 action reward [Zadrozny et al., 2003] [Langford & Li], [Bottou, et al., 2014]
  • 7. Learning Settings Full-Information (Labeled) Feedback Partial-Information (e.g. Bandit) Feedback Online Learning • Perceptron • Winnow • Etc. • EXP3 • UCB1 • Etc. Batch Learning • SVM • Random Forests • Etc. ?
  • 8. Outline Batch Learning from Bandit Feedback (BLBF) 𝑆 = 𝑥1, 𝑦1, 𝛿1 , … , 𝑥 𝑛, 𝑦𝑛, 𝛿 𝑛  Find new policy 𝜋 that selects 𝑦 with better 𝛿 Learning Principle for BLBF Hypothesis Space, Risk, Empirical Risk, and Overfitting Learning Principle: Counterfactual Risk Minimization Learning Algorithms for BLBF POEM: Bandit training of CRF policies for structured outputs BanditNet: Bandit training of deep network policies Learning from Logged Interventions: BLBF and beyond
  • 9. Online System Context 𝑥 Patient (e.g. lab tests) Action 𝑦 Treatment given Effect 𝛿 𝑥, 𝑦 Survival time Context 𝑥 User (e.g. query, profile) Action 𝑦 Ranking, ad, news Effect 𝛿 𝑥, 𝑦 Clicks, reading time Medical Treatment Causal Inference for Action Policy  Controlled Randomized Trial
  • 10. Policy Given context 𝑥, policy 𝜋 selects action 𝑦 with probability 𝜋 𝑦 𝑥 Note: stochastic policies ⊃ deterministic policies 𝜋1(𝑌|𝑥) 𝜋2(𝑌|𝑥) 𝑌|𝑥
  • 11. Utility The expected reward / utility U(𝜋) of policy 𝜋 is U 𝜋 = ෍ 𝑥∈𝑋 ෍ 𝑦∈𝑌 𝛿 𝑥, 𝑦 𝜋 𝑦 𝑥 𝑃 𝑥 𝜋1(𝑌|𝑥) 𝜋2(𝑌|𝑥) 𝑌|𝑥
  • 12. Estimating Utility A/B Testing: Field 𝜋 and collect 𝑆 = 𝑥1, 𝑦1, 𝛿1 , … , 𝑥 𝑛, 𝑦𝑛, 𝛿 𝑛  unbiased estimate of utility ෡𝑈 𝜋 = 1 𝑛 ෍ 𝑖=1 𝑛 𝛿𝑖 𝜋0(𝑌|𝑥) 𝑌|𝑥
  • 13. Estimating Utility Offline • Online A/B Test • Offline A/B Test  Counterfactual Estimates Draw 𝑆1 from 𝜋1  ෡𝑈 𝜋1 Draw 𝑆2 from 𝜋2  ෡𝑈 𝜋2 Draw 𝑆3 from 𝜋3  ෡𝑈 𝜋3 Draw 𝑆4 from 𝜋4  ෡𝑈 𝜋4 Draw 𝑆5 from 𝜋5  ෡𝑈 𝜋5 Draw 𝑆6 from 𝜋6  ෡𝑈 𝜋6 Draw 𝑆 from 𝜋0 ෡𝑈 ℎ1 ෡𝑈 ℎ1 ෡𝑈 ℎ1 ෡𝑈 ℎ1 ෡𝑈 ℎ1 ෡𝑈 𝜋6 ෡𝑈 ℎ1 ෡𝑈 ℎ1 ෡𝑈 ℎ1 ෡𝑈 ℎ1 ෡𝑈 ℎ1 ෡𝑈 𝜋12 ෡𝑈 ℎ1 ෡𝑈 ℎ1 ෡𝑈 ℎ1 ෡𝑈 ℎ1 ෡𝑈 ℎ1 ෡𝑈 𝜋18 ෡𝑈 ℎ1 ෡𝑈 ℎ1 ෡𝑈 ℎ1 ෡𝑈 ℎ1 ෡𝑈 ℎ1 ෡𝑈 𝜋24 ෡𝑈 ℎ1 ෡𝑈 ℎ1 ෡𝑈 ℎ1 ෡𝑈 ℎ1 ෡𝑈 ℎ1 ෡𝑈 𝜋30 Draw 𝑆7 from 𝜋7  ෡𝑈 𝜋7
  • 14. The Problem with Offline A/B-Tests  Missing information about 𝛿 𝑥, 𝑦 for actions chosen by new 𝜋 𝜋0(𝑌|𝑥𝑖) 𝜋(𝑌|𝑥𝑖) 𝑌|𝑥𝑖
  • 15. Approach 1: Reward Predictor Learn regression model መ𝛿: 𝑥 × 𝑦 → ℜ Use data 𝑆 = 𝑥1, 𝑦1, 𝛿1 , … , 𝑥 𝑛, 𝑦𝑛, 𝛿 𝑛 with feature vectors Ψ(𝑥, 𝑦) Impute መ𝛿 𝑥, 𝑦 when missing መ𝛿(𝑥, 𝑦) 𝛿 𝑥, 𝑦′ Ψ1 Ψ2 U 𝜋 = 1 𝑛 ෍ 𝑥 𝑖∈𝑆 ෍ 𝑦∈𝑌 መ𝛿 𝑥𝑖, 𝑦 𝜋 𝑦 𝑥𝑖
  • 16. Approach 2: Counterfactual Estimator Given 𝑆 = 𝑥1, 𝑦1, 𝛿1 , … , 𝑥 𝑛, 𝑦𝑛, 𝛿 𝑛 collected under 𝜋0,  Unbiased estimate of risk, if propensity nonzero everywhere (where it matters). ෡𝑈 𝜋 = 1 𝑛 ෍ 𝑖=1 𝑛 𝜋 𝑦𝑖 𝑥𝑖 𝜋0 𝑦𝑖 𝑥𝑖 𝛿𝑖 [Horvitz & Thompson, 1952] [Rubin, 1983] [Zadrozny et al., 2003] [Langford, Li, 2009.] Propensity 𝑝𝑖 𝜋0(𝑌|𝑥) 𝜋(𝑌|𝑥)
  • 17. Randomness is Your Friend Sources of randomness – 𝜋0 𝑌 𝑥 is random – Features in 𝑥 are random – Choice of 𝜋𝑖 𝑌 𝑥 is random – User behavior is random – Etc. 𝜋0(𝑌|𝑥𝑖) 𝑌|𝑥𝑖
  • 18. Minimizing Training Error • Setup – Log using stochastic 𝜋0 S = 𝑥1, 𝑦1, 𝛿1 , … , 𝑥 𝑛, 𝑦𝑛, 𝛿 𝑛 – Learn new policy 𝜋 ∈ 𝐻 • Training [Zadrozny et al., 2003] [Langford & Li], [Bottou, et al., 2014] ො𝜋 ≔ argmax 𝜋∈𝐻 ෍ 𝑖=1 𝑛 𝜋 𝑦𝑖 𝑥𝑖 𝜋0 𝑦𝑖 𝑥𝑖 𝛿𝑖 𝜋0(𝑌|𝑥) 𝜋1(𝑌|𝑥) 𝜋0(𝑌|𝑥) 𝜋237(𝑌|𝑥)
  • 19. Learning Theory for BLBF Theorem [Generalization Error Bound] For any policy space 𝐻 with capacity 𝐶, and for all 𝜋 ∈ 𝐻 with probability 1 − 𝜂  Bound accounts for the fact that variance of risk estimator can vary greatly between different 𝜋 ∈ H U 𝜋 ≥ ෡𝑈 𝜋 − 𝑂 ෣𝑉 𝑎𝑟 ෡𝑈 𝜋 𝑛 − 𝑂(𝐶) [Swaminathan & Joachims, 2015] Unbiased Estimator Variance Overfitting Capacity Overfitting
  • 20. Counterfactual Risk Minimization Constructive principle for learning algorithms  Maximize learning theoretical bound [Swaminathan & Joachims, 2015] 𝜋 𝑐𝑟𝑚 = argmax 𝜋∈𝐻 ෡𝑈 𝜋 − 𝜆1 ෢𝑉𝑎𝑟 ෡𝑈 𝜋 /𝑛 − 𝜆2 𝐶(𝐻) Capacity Regularization Variance Regularization Training Error/Utility (IPS)
  • 21. Outline Batch Learning from Bandit Feedback (BLBF) 𝑆 = 𝑥1, 𝑦1, 𝛿1 , … , 𝑥 𝑛, 𝑦𝑛, 𝛿 𝑛  Find new policy 𝜋 that selects 𝑦 with better 𝛿 Learning Principle for BLBF Hypothesis Space, Risk, Empirical Risk, and Overfitting Learning Principle: Counterfactual Risk Minimization Learning Algorithms for BLBF POEM: Bandit training of CRF policies for structured outputs BanditNet: Bandit training of deep network policies Learning from Logged Interventions: BLBF and beyond
  • 22. POEM: Policy Space Policy space 𝜋 𝑤 𝑦 𝑥 = 1 𝑍(𝑥) exp 𝑤 ⋅ Φ 𝑥, 𝑦 with – 𝑤: parameter vector to be learned – Φ 𝑥, 𝑦 : joint feature map between input and output – Z(x): partition function (i.e. normalizer) Note: same form as CRF or Structural SVM
  • 23. POEM: Learning Method Policy Optimizer for Exponential Models (POEM) – Data: 𝑆 = 𝑥1, 𝑦1, 𝛿1, 𝑝1 , … , 𝑥 𝑛, 𝑦𝑛, 𝛿 𝑛, 𝑝 𝑛 – Policy space: 𝜋 𝑤 𝑦 𝑥 = exp 𝑤 ⋅ 𝜙 𝑥, 𝑦 /𝑍(𝑥) [Swaminathan & Joachims, 2015] 𝑤 = argmax 𝑤∈ℜ 𝑁 ෡𝑈 𝜋 𝑤 − 𝜆1 ෢𝑉𝑎𝑟 ෡𝑈 𝜋 𝑤 − 𝜆2 𝑤 2 Capacity Regularization Variance Regularization IPS Estimator
  • 24. POEM: Text Classification Data: Reuters Text Classification – 𝑆∗ = 𝑥1, 𝑦1 ∗ , … , 𝑥 𝑚, 𝑦 𝑚 ∗ – Label vectors 𝑦∗ = (𝑦1 , 𝑦2 , 𝑦3 , 𝑦4 ) Results: Bandit feedback generation: – Draw document 𝑥𝑖 – Pick 𝑦𝑖 via logging policy 𝜋0 𝑌|𝑥𝑖 – Observe loss 𝛿𝑖 = Hamming(𝑦i, 𝑦𝑖 ∗ )  𝑆 = 𝑥1, 𝑦1, 𝛿1, 𝑝1 , … , 𝑥 𝑛, 𝑦 𝑛, 𝛿 𝑛, 𝑝 𝑛 [Joachims et al., 2017] 𝑦𝑖 = 1,0,1,0 𝑝𝑖 = 0.3 𝜋0 𝛿𝑖 = 2 Learning from Logged Interventions Every time a system places an ad, presents a search ranking, or makes a recommendation, we can think about this as an intervention for which we can observe the user's response (e.g. click, dwell time, purchase). Such logged intervention data is actually one of the most plentiful types of data available, as it can be recorded from a variety of 𝑥𝑖 0.28 0.33 0.38 0.43 0.48 0.53 0.58 1 2 3 4 5 6 7 8 HammingLoss Amount of bandit training data (in epochs) Series1 Series2 Series3 𝜋0 logging policy POEM CRF (supervised)
  • 25. POEM: Real-World Application Task: Given query 𝑥 – Select answer 𝑦 from 𝑌(𝑥) (or abstain) – Maximize correct-incorrect answers Existing System: – Small 𝐷 = 𝑥1, 𝑦1 ∗ , … – Logistic regression scorer 𝑠 𝑦 𝑥 = 𝑤 ⋅ Φ 𝑥, 𝑦 – Select 𝑦 = argmax 𝑌(𝑥) 𝑠 𝑦 𝑥 Bandit Data: – Field 𝜋0 𝑦 𝑥 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝜆 𝑠(𝑦|𝑥) – Tune 𝜆 for desired stochasticity – User feedback 𝛿 ∈ −1,0,1  𝑆 = 𝑥1, 𝑦1, 𝛿1, 𝑝1 , … , 𝑥 𝑛, 𝑦 𝑛, 𝛿 𝑛, 𝑝 𝑛 Results: – Learn new 𝜋(𝑦|𝑥) via POEM on 𝑆 – Size of S about 5000  Improvement: 30% over existing [Swaminathan & Joachims, 2015]
  • 26. Outline Batch Learning from Bandit Feedback (BLBF) 𝑆 = 𝑥1, 𝑦1, 𝛿1 , … , 𝑥 𝑛, 𝑦𝑛, 𝛿 𝑛  Find new policy 𝜋 that selects 𝑦 with better 𝛿 Learning Principle for BLBF Hypothesis Space, Risk, Empirical Risk, and Overfitting Learning Principle: Counterfactual Risk Minimization Learning Algorithms for BLBF POEM: Bandit training of CRF policies for structured outputs BanditNet: Bandit training of deep network policies Learning from Logged Interventions: BLBF and beyond
  • 27. BanditNet: Policy Space Policy space 𝜋 𝑤 𝑦 𝑥 = 1 𝑍(𝑥) exp 𝐷𝑒𝑒𝑝𝑁𝑒𝑡(𝑥, 𝑦|𝑤) with – 𝑤: parameter tensors to be learned – Z(x): partition function Note: same form as Deep Net with softmax output [Joachims et al., 2017]
  • 28. BanditNet: Learning Method • Deep networks with bandit feedback (BanditNet): – Data: 𝑆 = 𝑥1, 𝑦1, 𝛿1, 𝑝1 , … , 𝑥 𝑛, 𝑦𝑛, 𝛿 𝑛, 𝑝 𝑛 – Hypotheses: 𝜋 𝑤 𝑦 𝑥 = exp 𝐷𝑒𝑒𝑝𝑁𝑒𝑡 𝑥|𝑤 /𝑍(𝑥) [Joachims et al., 2018] 𝑤 = argmax 𝑤∈ℜ 𝑁 ෡𝑈 𝜋 𝑤 − 𝜆1 ෢𝑉𝑎𝑟 ෡𝑈 𝜋 𝑤 − 𝜆2 𝑤 2 Capacity Regularization Variance Regularization Self-Normalized IPS Estimator
  • 29. BanditNet: Object Recognition • Data: CIFAR-10 – 𝑆∗ = 𝑥1, 𝑦1 ∗ , … , 𝑥 𝑚, 𝑦 𝑚 ∗ – ResNet20 [He et al., 2016] • Results • Bandit feedback generation: – Draw image 𝑥𝑖 – Pick 𝑦𝑖 via logging policy 𝜋0 𝑌|𝑥𝑖 – Observe loss 𝛿𝑖 = 𝑦𝑖 ≠ 𝑦𝑖 ∗  𝑆 = 𝑥1, 𝑦1, 𝛿1, 𝑝1 , … , 𝑥 𝑛, 𝑦 𝑛, 𝛿 𝑛, 𝑝 𝑛 [Beygelzimer & Langford, 2009] [Joachims et al., 2017] 𝑦𝑖 = dog 𝑝𝑖 = 0.3 𝜋0 𝛿𝑖 = 1
  • 30. Outline Batch Learning from Bandit Feedback (BLBF) 𝑆 = 𝑥1, 𝑦1, 𝛿1 , … , 𝑥 𝑛, 𝑦𝑛, 𝛿 𝑛  Find new policy 𝜋 that selects 𝑦 with better 𝛿 Learning Principle for BLBF Hypothesis Space, Risk, Empirical Risk, and Overfitting Learning Principle: Counterfactual Risk Minimization Learning Algorithms for BLBF POEM: Bandit training of CRF policies for structured outputs BanditNet: Bandit training of deep network policies Learning from Biased Feedback: BLBF and beyond
  • 31. Unbiased Learning-to-Rank from Clicks Idea: Exploit randomness of users – SVM PropRank – Deep PropDCG – Propensity Estimation Presented 𝒚 𝟏 A B C D E F G Click Presented 𝒚 𝟏 A B C D E F G Click Presented 𝒚 𝟏 A B C D E F G Click Presented 𝒚 𝟏 A B C D E F G Click Presented 𝒚 𝟏 A B C D E F G Click Presented 𝒚 𝟏 A B C D E F G Click Presented ഥ𝒚 𝒏 A B C D E F G Click Click New Ranker 𝜋(𝑥) Learning Algorithm Query Distribution 𝑥𝑖 ∼ 𝑷(𝑿) Deployed Ranker ത𝑦𝑖 = 𝝅 𝟎(𝑥𝑖) Should perform better than 𝜋0(𝑥) [Joachims et al., 2017] [Agarwal et al., 2018]
  • 32. Matrix Factorization Missing data and selection bias • System and user generated • Unbiased matrix factorization 5 4 5 4 4 3 3 4 4 1 3 1 2 1 2 2 2 1 1 1 3 2 2 2 1 3 1 3 4 4 1 2 1 1 1 3 2 2 2 5 3 5 4 4 4 2 1 2 3 2 3 2 1 1 1 2 2 2 2 1 1 1 1 1 1 1 2 2 2 3 1 1 5 3 5 4 3 3 4 3 4 Users Movies 4 5 4 3 3 4 2 4 4 2 5 3 4 4 1 2 5 3 3 3 4 4 Users Movies True Rating Matrix Y* Observed Rating Matrix ෨𝑌 Partially revealed Predict ෠𝑌 / Policy Horror Romance Comedy Romance lovers Horror lovers Comedy lovers [Schnabel et al., 2016]
  • 33. Conclusions Learning with Biased Feedback Data – Evaluation: Offline A/B tests – Learning: Learn new 𝜋 that selects 𝑦 with better 𝛿  Causal and Counterfactual Inference Tutorial, Software, Papers, Data: www.joachims.org