multi-armed bandit

Intuit Confidential and Proprietary 1
CTG Data Science Lab
August 17, 2016
Multi-armed Bandit Problem
Potential Improvement for DARTS
Aniruddha Bhargava, Yika Yujia Luo

Agenda
1. Problem Overview
2. Algorithms
Non-contextual cases
Contextual cases
3. Industry Review
4. Advanced Topics

Problem Overview

When do we run into Multi-armed Bandit Problem (MAB)?
Gambling Research Funding
Clinical Trials Content Management

What is Multi-armed Bandit Problem (MAB)?
Goal: Pick the best restaurant efficiently
Logistics: Select a restaurant for each person, who leaves you a tip afterwards
$1 $8 $10
How?
$3 $6 $6Average: $2 Average: $7 Average: $6

MAB Terminology
Exploration: a learning process of people’s
preferences, always involves a certain degree of
randomness
Exploitation: use the current, reliable knowledge
of a certain parameter to select a restaurant
Arm: restaurant
Expected Reward: Average tips in the end
Regret: expected tip loss after sending a person
to a restaurant that is not the best
Policy: a strategy that you use to select restaurant
Total Cumulative Regret: the total tips you lose -
- a performance measure for bandit algorithms
Expected: $1
Expected: $10
Regret is $9!
Expected: $8
Regret is $9!
Regret is $2!
0 Regret!
Total regret: $20 
User: People sent to restaurants
Reward: Tips
$0

Big Picture
MAB Big Picture
Decision
Making
Optimization
MAB
Choose the best product
by finding the best restaurant
to go
Minimize total regret
by avoiding sending people to bad
restaurants as much as possible

Algorithms
(Non-contextual Cases)
“Anytime you are faced with the problem of both exploring
and exploiting a search space, you have a bandit problem.
Any method of solving that problem is a bandit algorithm”
-- Chris Stucchio

Non-Contextual
Non-contextual V.S. Contextual
User Product
IMPORTANT THING HERE: Although everyone has different taste, we pick one best restaurant for everyone

ε-greedy
Thompson Sampling
Upper Confidence Bound (UCB)
MAB Policies
There are more bandit algorithms… ...
A/B Testing
Adaptive

AB Testing
Person i Random
100%
Exploration Exploitation
Person j
100%

ε-greedy
Person i
Highest
average tips
Random
Record person i’s
feedback,
Update that
restaurant’s average
tips value
Select (ε = 0.2)
Update

Upper Confidence Bound (UCB)
Person i
Highest
upper
confidence
bound Record person i’s
feedback,
Update the upper
confidence bound
of that
restaurant’s
average tips
Select
Update
Average tips
from restaurant j #people went
to restaurant j
#people
100%

Thompson Sampling (Bayesian)
Person i
Highest
tips from
the
sampling
Record person i’s
feedback,
Update that
restaurant’s average
tip distribution
Select
Update
Simulate
3 restaurants’
average tip
distribution,
randomly draw
a value
from each
distribution
Sampling
McDonald’s
Subway
Chili's
Average Tips($)
100%

Thompson Sampling (Bayesian)
Pr(r < b) = 10% Pr(r < b) = 0.01%

Algorithm Comparison
1. Exploration V.S Exploitation
2. Total Regret
3. Batch Update

Algorithm Comparison: Exploration V.S. Exploitation
IMPORTANT THING HERE:
Exploration costs money!
Exploration(%)
Time (%)
75
50
25
0
100
25 50 75 100
AB Testing
ε
ε-greedy

Algorithm Comparison: Total Regret
M
44%
S
28%
C
28%
AdaptiveAB Testing
M
70%
S
18%
C
12%
Time Time

Algorithm Comparison: Batch Update
AB Testing ε-greedy UCB Thompson
Very Robust Depends Not Robust Robust
System User
Question
Answer
Store
Many
Answers

Algorithm Comparison: Summary
AB Testing ε-greedy UCB Thompson
• Easy to
implement
• If good ε found,
lower total regret
and faster to find
best arm than ε-
first
• Good for large amount of arms
• Find the best arm fast
• Low total regret
• Robust to batch
update
Pros
Cons
• Easy to
implement
• Good for small
amount of arms
• Robust to batch
update
• Not robust to
batch update
• Sensitive to statistical
assumptions
• High total
regrets
• Need to figure
out good ε
• High total
regrets

ContextualNon-Contextual
Non-contextual V.S. Contextual
Female
Vegetarian
Married
Latino
Burger
Non-
Vegetarian
Cheap
Good
Service
User Product
IMPORTANT THING HERE: Everyone has different tastes, so we pick one best restaurant for each person

Agenda
1. Problem Overview
2. Algorithms
Contextual cases
3. Industry Review
4. Advanced Topics

Algorithms
(Contextual Bandits)

What do we mean by context?
Likes spicy food, refined
tastes, plays violin, Male,
…
From Wisconsin, likes
German food, likes
Football, Male, …
Student, doesn’t like
seafood, allergic to cats,
Female, …
Chief of AFC, watches
shows on competitive
eating, Female, …
User side Arm side
Tex-Mex style, sit down dining,
founded in 1975, …
Serves sandwiches, has veggie
options, founded in 1965, …
Breakfast, lunch, and dinner, cheap,
founded in 1940, …

User Context
0
2
4
6
8
10
12
14
16
0 5 10 15 20 25
Average reward over time
Non-contextual Best possible without context Context (user) Best possible with context
Non-Contextual
User Context

Arm Context
0
2
4
6
8
10
12
14
16
0 5 10 15 20 25
Average reward over time
Non-contextual Contextual (arm)
Contextual (user) Best possible without user context
Best possible with user context Context (arm and user)
Non-contextualOnly arm context
Both arm and user context

User context can increase the
optimal rewards;
Arm context can get you there
faster!
Takeaway Message

User side:
Population segmentation
e.g. DARTS
Clustering users
Learning embedding
Arms side:
Linear models:
LinUCB, Linear TS, OFUL
Maintain estimate of best arm
More data → shrink uncertainty
Exploiting Context

Assumptions:
• Users can be represented as points in space
• Users cluster together so that points that are close are similar
• Stationarity
Exploiting User Context

meat vegetarian
spicymild
Joe
Yao
Nichola
Peter
Aniruddha
Rachel
Sophie
Yika
Vineeta
Jason Andre
Chris
Madeline
John

Linear
meat vegetarian
spicymild
Joe
Yao
Nichola
Peter
Aniruddha
Rachel
Sophie
Yika
Vineeta
Jason Andre
Chris
Madeline
John

meat vegetarian
spicymild
Joe
Yao
Nichola
Peter
Aniruddha
Rachel
Sophie
Yika
Vineeta
Jason Andre
Chris
Madeline
John
Quadratic

meat vegetarian
spicymild
Joe
Yao
Nichola
Peter
Aniruddha
Rachel
Sophie
Yika
Vineeta
Jason Andre
Chris
Madeline
John
40% 35% 25%
Hierarchical

meat vegetarian
spicymild
Joe
Yao
Nichola
Peter
Aniruddha
Rachel
Sophie
Yika
Vineeta
Jason Andre
Chris
Madeline
John
80% 15% 5%
5% 15% 80%
Hierarchical

meat vegetarian
spicymild
Joe
Yao
Nichola
Peter
Aniruddha
Rachel
Sophie
Yika
Vineeta
Jason Andre
Chris
Madeline
John
5% 50% 45%
80% 15% 5%
5% 10% 85%
15% 80% 5%
Hierarchical

80% 15% 5%
meat vegetarian
spicymild
Joe
Yao
Nichola
Peter
Aniruddha
Rachel
Sophie
Yika
Vineeta
Jason Andre
Chris
Madeline
John
5% 5% 90%
10% 45% 45%
5% 50% 45%
15% 80% 5%
Hierarchical

Assumptions:
• We can represent arms as vectors.
• Rewards are a noisy version of the inner product.
• Stationarity.
Look at only arm context and no user context
Methods include:
• Linear UCB
• Linear Thompson Sampling
• OFUL (Optimism in the Face of Uncertainty – Linear)
• ... and many more.
Linear models
Exploiting Arm Context

The Math Slide
Standard noisy linear model:
rt = xtTθ* + ηt
θ* : the optimal arm
xt : arm pulled at time t
rt : reward at time t
ηt : noise at time t
Ct : confidence set
λ : ridge term
Xt : matrix of all arms
pulled till time t
Collect all data and write:
r = X θ* + η
Least Squares Solution:
θLS = (XTX)-1 XTr
Ridge regression:
θLSR = (XTX + λI)-1 XTr
Typical Linear Bandit
algorithm:
θ0 = 0
t = 0,1,2,…
xt = argmaxx∈Ct (xTθt )
θt = (Xt
TXt + λI)-1 Xt
Trt

Exploiting Arm Context Arms
Optimal arm
meat vegetarian
spicymild
Mince pie
Buffalo wings
Tofu scramble
Grilled
vegetables
Ratatouille
Tandoori
Chicken
Jalapeno
scramble
Pad Thai
Penne Arrabiata
Set of Arms
x1, x2, …
θ* : the optimal arm

Exploiting Arm Context Arms
Optimal arm
Next arm
chosen
Reward (=cos(θ)) is small, but we can still infer
information about other arms!
Buffalo wings
θ

C1
θ1
Arms
Optimal arm
Next arm
chosen
Estimate of
optimal arm
Region of
uncertainty

We’ve already honed in on a pretty good choice
x2
Arms
Optimal arm
Next arm
chosen
Estimate of
optimal arm
Region of
uncertainty

And the process continues …
C2
θ2
Arms
Optimal arm
Next arm
chosen
Estimate of
optimal arm
Region of
uncertainty

• Big assumption that we know good features.
• Finding features takes a lot of work.
• Few arms, many people → learn an embedding of arms
• Few people, many arms → Featurize, linear bandits
• Linear models are a naive assumption, see kernel methods.
Some Caveats

Agenda
1. Problem Overview
2. Algorithms
Contextual cases
3. Industry Review
4. Advanced Topics

Industry Review

Companies using MAB

Headlines, Photos and Ads
Washington Post Google

Used Upper Confidence Bound (UCB) to picking headlines and photos
Washington Post

Google Experiments
Used Thompson Sampling (TS)
Updated models twice a day
Two metrics used to gauge end of experiment:
• 95% confidence that alternate better or …
• "potential value remaining in the experiment”

The more arms the higher the
gain over A/B testing.
Takeaway Message

Advanced Topics

Biasing
Data Joining and Latency
Non-stationary
Topics

Bias
Website 1 Website 2
50% 50%Probability
Number
sold
100 20
90% 10%Probability
Number
sold
100 20
Who did better?

• Be careful when using past data!
• Inverse Propensity Score Matching
• New sales estimates:
Bias
Website 1: 100*0.5+20*0.5 = 60
Website 2: 100*0.5*(0.5/0.9) + 20*0.5*(0.5/0.1) = 75

Data Joining and Latency
Courtesy: Microsoft MWT white paper
Context, decision
Rewards
Latency

Non-Stationarity – Beer example
January April July October December
Stouts and
porters
Pale Ales
and IPAs
Wits and
Lagers
Oktoberfests
and Reds
Christmas
Ales
My yearly beer taste:

Preferences change over time.
There may be periodicity in data, Tax season is a great example.
Some solutions:
• Slow changes → System with finite memory
• Abrupt changes → Subspace tracking/anomaly detection
Non-Stationarity

Preferences change over time,
biases are added and data
needs to be joined from
different sources.
Takeaway Message

Thank You.
Questions?

multi-armed bandit

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (12)

Similar to multi-armed bandit

Similar to multi-armed bandit (20)

multi-armed bandit

Editor's Notes