This document discusses bandit algorithms and their applications. It begins with an overview of multi-armed bandit problems and explores various algorithms to solve them, including naive, epsilon-greedy, and Thompson sampling algorithms. Thompson sampling is shown to have logarithmic regret compared to linear regret for other algorithms. The document then discusses contextual bandits, which incorporate context into the problem to select the best arm given context, and adversarial contextual bandits, where payoffs may change. Real-world applications of bandits include medical treatments, testing, optimization, and recommendations. Contextual bandits can be implemented using APIs, databases, and services. Bandit algorithms can solve many problems modeled as games.
LEARNING HOW TOPLAY TETRIS
ACTION
WORLD
OBSERVATION
AGENT
REWARD
(Rotate/Up/Down/
Left/Right etc)
Score since
previous action
6.
THE (STOCHASTIC)
MULTI-ARMED
BANDIT PROBLEM
Let’ssay you are in a Las Vegas casino, and
there are three vacant slot machines (one-
armed bandits). Pulling the arm of a machine
yields either $0 or $1.
Given these three bandit machines, each with
an unknown payoff strategy, how should we
choose which arm to pull to maximise* the
expected reward over time?
* Equivalently, we want to minimise the expected
regret over time (minimise how much money we
lose).
7.
FORMALLY(ISH)
Problem
We have threeone-armed bandits. Each arm
yields a reward of $1 according to some fixed
unknown probability Pa , or a reward of $0 with
probability 1-Pa.
Objective
Find Pa, ∀a ∈ {1,2,3}
8.
OBSERVATION #1
This isa partial information problem: when we
pull an arm, we don’t get to see the rewards of
the arms we didn’t pull.
(For math nerds: a bandit problem is an MDP
with a single, terminal, state).
9.
OBSERVATION #2
We needto create some approximation, or best
guess, of how the world works in order to
maximise our reward over time.
We need to create a model.
“All models are wrong but some are useful” —
George Box
10.
OBSERVATION #3
Clearly, weneed to try (explore) the arms of all
the machines, to get a sense of which one is
best.
11.
OBSERVATION #4
Though explorationis necessary, we also need
to choose the best arm as much as possible
(exploit) to maximise our reward over time.
A NAÏVE ALGORITHM
Step1: For the first N pulls
Evenly pull all the arms, keeping track of the
number of pulls & wins per arm
Step 2: For the remaining M pulls
Choose the arm the the highest expected
reward
15.
A NAÏVE ALGORITHM
Step1: For the first N pulls
Evenly pull all the arms, keeping track of the
number of pulls & wins per arm
Step 2: For the remaining M pulls
Choose the arm the the highest expected
reward
Arm 1: {trials: 100, wins: 2}
Arm 2: {trials: 100, wins: 21}
Arm 3: {trials: 100, wins: 17}
Arm 1: 2/100 = 0.02
Arm 2: 21/100 = 0.21
Arm 3: 17/100 = 0.17
16.
A NAÏVE ALGORITHM
Step1: For the first N pulls
Evenly pull all the arms, keeping track of the
number of pulls & wins per arm
Step 2: For the remaining M pulls
Choose the arm the the highest expected
reward
Arm 1: {trials: 100, wins: 2}
Arm 2: {trials: 100, wins: 21}
Arm 3: {trials: 100, wins: 17}
Arm 1: 2/100 = 0.02
Arm 2: 21/100 = 0.21
Arm 3: 17/100 = 0.17
Explore
Exploit
17.
EPOCH-GREEDY [1]
Step 1:For the first N pulls
Evenly pull all the arms, keeping track of the
number of pulls & wins per arm
Step 2: For the remaining M pulls
Choose the arm the the highest expected
reward
Arm 1: {trials: 100, wins: 2}
Arm 2: {trials: 100, wins: 21}
Arm 3: {trials: 100, wins: 17}
Arm 1: 2/100 = 0.02
Arm 2: 21/100 = 0.21
Arm 3: 17/100 = 0.17
Explore
Exploit
[1]: http://hunch.net/~jl/projects/interactive/sidebandits/bandit.pdf
18.
EPSILON-GREEDY/Ε-GREEDY [2]
Choose avalue for the hyperparameter ε
Loop forever
With probability ε:
Choose an arm uniformly at random, keep track of trials and wins
With probability 1-ε:
Choose the arm the the highest expected reward
Explore
Exploit
[2]: people.inf.elte.hu/lorincz/Files/RL_2006/SuttonBook.pdf
19.
OBSERVATIONS
For N =300 and M = 1000, the Epoch-Greedy algorithm will choose
something other than the best arm 200 times, or about 15% of the time.
At each turn, the ε-Greedy algorithm will choose something other than the
best arm with probability P=(ε/n) × (n-1), n=number of arms. It will always
explore with this probability, no matter how many time steps we have
done.
THOMPSON SAMPLING ALGORITHM
Foreach arm, initialise a uniform probability distribution (prior)
Loop forever
Step 1: sample randomly from the probability distribution of each
arm
Step 2: choose the arm with the highest sample value
Step 3: observe reward for the chosen and update the
hyperparameters of its probability distribution (posterior)
22.
EXAMPLE WITH TWOARMS (BLUE & GREEN)
Assumption: rewards follow a Bernoulli process, which allows us to use a 𝛽-distribution as a
conjugate prior.
FIRST, INITIALISE A UNIFORM RANDOM PROB.
DISTRIBUTION FOR BLUE AND GREEN
0 10 1
23.
FIRST, INITIALISE AUNIFORM RANDOM PROB.
DISTRIBUTION FOR BLUE AND GREEN
0 1
EXAMPLE WITH TWO ARMS (BLUE & GREEN)
24.
RANDOMLY SAMPLE FROMBOTH ARMS (BLUE GETS THE
HIGHER VALUE IN THIS EXAMPLE)
0 1
EXAMPLE WITH TWO ARMS (BLUE & GREEN)
25.
PULL BLUE ,OBSERVE REWARD (LET’S SAY WE GOT A
REWARD OF 1)
0 1
EXAMPLE WITH TWO ARMS (BLUE & GREEN)
MULTI-ARMED BANDITS ARE
GOOD…
Sofar, we’ve covered the “vanilla” multi-armed bandit problem. Solving this
problem lets us find the globally best arm to maximise our expected reward
over time.
However, depending on the problem, the globally best arm may not always
be the best arm.
32.
…BUT CONTEXTUAL BANDITS
ARETHE HOLY GRAIL
The the contextual bandit setting, once we pull an arm and receive a reward,
we also get to see a context vector associated with the reward. The
objective is now to maximise the expected reward over time given the
context at each time step.
Solving this problem gives us a higher reward over time in situations where
the globally best arm isn’t always the best arm.
Example algorithm: Thompson Sampling with Linear Payoffs [3]
[3]: https://arxiv.org/abs/1209.3352
33.
…ACTUALLY, ADVERSARIAL
CONTEXTUAL BANDITSARE THE
HOLY GRAIL
Until now, we’ve assumed that each one-armed bandit gives a $100 reward
according to some unknown probability P. In real-world scenarios, the world
rarely behaves this well. Instead of a simple dice roll, the bandit may pay out
differently depending on the day/hour/amount of money in the machine. It
works as an adversary of sorts.
A bandit algorithm that is capable of dealing with changes in payoff
structures in a reasonable amount of time tend to work better than
stochastic bandits in real-world settings.
Example algorithm: EXP4 (a.k.a the “Monster” algorithm) [4]
[4]: https://arxiv.org/abs/1002.4058
34.
CONTEXTUAL BANDIT
ALGORITHMS ARE1) STATE-OF-
THE-ART & 2) COMPLEX
• Many algorithms are <5 years old. Some of the most interesting ones
are less than a year old.
• Be prepared to read lots of academic papers and struggle
implementing the algorithms.
• Don’t hesitate to contact the authors if there is some detail you don’t
understand (thanks John Langford and Shipra Agrawal!)
• Always remember to simulate and validate bandits using real data.