Why do we run
Before During After
Logo 1 + Logo 2
Before During After
Explore or exploit
Option that will be the best in the future
Option that has been the best in the past
Choose any option
Choose the subjectively best option
((1-e) * 100)% to subjectively best
(e/2 * 100)% to subjectively best
(e/2 * 100)% to subjectively worst
Run random simulations 1,000’s of times
P(A) = 0.1
P(B) - 0.2
A: 0.01% after 100 trials
B: 0.02% after 100 trials
A: 0.01% after 100,000,000 trials
A: 0.02% after 100,000,000 trials
Upper Confidence Bound
Gotcha: rewards have to be between 0.0 and 1.0
Works best on conversion rates.
Not as well on arbitrary dollar rewards.
• Other UCB* algorithms
• LinUCB / GLM-UCB
• Exp3 and other Exp* algorithms
Why run AB tests?
because we might have some idea that something is better
The promise of MAB algorithms is that we can do something more like this.
Ideally we want to be able to take advantage of what we learn as we go.
So there’s this dilemma between whether we exploit what we think is the best based on what we’ve seen
or to explore other options to find out more about them.
MAB’s introduce this concept of regret. It’s how often did you have to try the objectively worst option in order to figure out the objectively best.
Lets look at a few classic MAB algorithms
Epsilon greedy works by alternating between exploration and exploitation.
The name comes from the parameter epsilon that determines how much exploitation to do vs exploring.
So one of the weaknesses of epsilon greedy is that it doesn’t take into account of the proportional differences between variations.
Softmax attempts to address this by exploring options in proportion to how good they appear to be.
Lets say we’ve got two options and one is twice as good as the other.
So we could do a straight proportionality but instead softmax does this trick with exponentials so you can have rewards of arbitrary sizes but get back values between 0 and 1. The exponential thing kind of squishes it into a known range. You can even have negative rewards.
Softmax also has a concept of this “temperature”. Bigger temperature means more “energy”, more random.. closer to 50/50 A/B
Lower number closer to 0 will explore the best option more in proportion to how good it is. Temp of 0 will be 100% exploitation.
So one of the weaknesses of softmax is it doesnt take into account how much you know about the diff options
Idea is to keep track of how much you know about each option
gives you a measure of how confident you are about different options
There’s a whole family of UCB algorithms, but this is one called UCB1
So we take the observed conversion rate and add the confidence bound as a kind of bonus
So the main gotcha of UCB1 is that the payoff has to be between 0 and 1.
There are a bunch of variations of UCB algorithms.
Also, contextual bandit algorithms that can take into account information about visitors.
Exp3 algorithms are useful
Thanks to Lars.
Also, thanks to John Myles White. This talk is based on a presentation of his.