2. Challenge Description
• RL –serves the option that aims to maximize the reward
(e.g. if we measure clicks we wish to serve the option that will to be
clicked with the biggest probability )
Problem: After a certain duration there is a stronger option that will
always be served.
4. Multi –Armed Bandit (bandit)
• The problem :
Consider a casino with many slot machines. Each with a certain
unknown pay-out rates (e.g. 0.6 ,0.3, 0.4).
We aim to maximize our reward, hence we should learn the rates.
Exploration – We explore over the payouts
Exploitation – We assume that we have learned and we take the optimal
Q: How to balance between Exploration & Exploitation ?
Bandit algorithms verify that exploration will always take place
5. Bandit (Cont.)
• We can do A/B testing
1. Consider K machines
2. Play each of them randomly and measure the reward
3. Take the best measured rate.
• We can do UCB
• Impressions
• Responses (Positive responses)
• Opportunities
6. UCB – How does it work?
• We measure the pay-out rate of each option as in A/B
• Rather taking the biggest rate we take the rate+std
• It can be used as exploration mechanism (We follow this mechanism)
• It can be used in exploitation (explore and while exploiting using this
mechanism)
9. Chernoff Hoefding (cont)
• For UCB needs we take :
• ε = 2log(t) /s where t is the amount of samples and s the amount of
impressions for a single arm .
• With some manipulations we get
• P(µi + 2log(t) /s ≤ µi) ≤ exp(-4log(t)) =-𝑡4
10. Formulas
• UCB= P +sqrt( (1-p) * p /impressions)
• Auer improvement
UCB =P +sqrt((1-p)*P*log(opportunities) /impressions))
• Next improvement
• UCB = P +sqrt((1-p)*P*log(opportunities) /impressions)) +log(opportunities
)/impressions -
• Note that this correction term may go to infinity thus we have a window,
• Further reading – Chernoff/Hoeffding inequality
11. Where it is used?
• In Causata’s engine –Exploration and solely exploration
• One can use the current exploration mechanism and use UCB as
exploitation (i.e. rather taking the best mean take the best UCB)