Ucb

Challenge Description
• RL –serves the option that aims to maximize the reward
(e.g. if we measure clicks we wish to serve the option that will to be
clicked with the biggest probability )
Problem: After a certain duration there is a stronger option that will
always be served.

Epsilon-Greedy
• What is epsilon greedy?
• Causata’s implementation .

Multi –Armed Bandit (bandit)
• The problem :
Consider a casino with many slot machines. Each with a certain
unknown pay-out rates (e.g. 0.6 ,0.3, 0.4).
We aim to maximize our reward, hence we should learn the rates.
Exploration – We explore over the payouts
Exploitation – We assume that we have learned and we take the optimal
Q: How to balance between Exploration & Exploitation ?
Bandit algorithms verify that exploration will always take place

Bandit (Cont.)
• We can do A/B testing
1. Consider K machines
2. Play each of them randomly and measure the reward
3. Take the best measured rate.
• We can do UCB
• Impressions
• Responses (Positive responses)
• Opportunities

UCB – How does it work?
• We measure the pay-out rate of each option as in A/B
• Rather taking the biggest rate we take the rate+std
• It can be used as exploration mechanism (We follow this mechanism)
• It can be used in exploitation (explore and while exploiting using this
mechanism)

Chernoff/Hoeffding
• Chernoff/Hoeffding
• Let Xi ∈ [ai , bi ] independent random variables with µi = E[Xi ]
P(Ʃ|xi- µi| ≥ε) ≤ 2*exp((-2ε2 )/(Ʃ|𝑏𝑖 − 𝑎𝑖|2 ))
For every ε >0

Chernoff Hoefding (cont)
• For UCB needs we take :
• ε = 2log(t) /s where t is the amount of samples and s the amount of
impressions for a single arm .
• With some manipulations we get
• P(µi + 2log(t) /s ≤ µi) ≤ exp(-4log(t)) =-𝑡4

Formulas
• UCB= P +sqrt( (1-p) * p /impressions)
• Auer improvement
UCB =P +sqrt((1-p)*P*log(opportunities) /impressions))
• Next improvement
• UCB = P +sqrt((1-p)*P*log(opportunities) /impressions)) +log(opportunities
)/impressions -
• Note that this correction term may go to infinity thus we have a window,
• Further reading – Chernoff/Hoeffding inequality

Where it is used?
• In Causata’s engine –Exploration and solely exploration
• One can use the current exploration mechanism and use UCB as
exploitation (i.e. rather taking the best mean take the best UCB)

Ucb

Recommended

Recommended

More Related Content

Similar to Ucb

Similar to Ucb (20)

More from Natan Katz

More from Natan Katz (15)

Recently uploaded

Recently uploaded (20)

Ucb