3. 本日紹介する論文
• Pandey, S., Agarwal, D., Chakrabarti, D., &
Josifovski, V. (2007, January). Bandits for
Taxonomies: A Model-based Approach.
In SDM.
• Chakrabarti, D., Kumar, R., Radlinski, F., &
Upfal, E. (2008). Mortal Multi-Armed
Bandits. In NIPS (pp. 273-280).
14. 平均報酬の比較
size for set U. For U = ik, U = B⇡ik
and for U = R(i; +), U = v B
not fit in B⇡ik
, ˜pik = ˆpik and U = 0 for U = ik, R(i; +).
Figure 8: UCB1 w/ shrinkage pol
0
200
400
600
800
1000
1200
0 5000 10000 15000 20000 25000
Revenue
Number of pulls
Multi-level
UCB1
Round-robin
2
3
4
5
6
7
8
9
10
11
10000
MSE
M
Ro
(a) Revenue profile
19. キーとなる考え方
• 腕の払う報酬の累積分布を調査
!
!
!
!
(a)
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0 0.2 0.4 0.6 0.8 1
Fractionofarms
Payoff probability (scaled)
Ad Payoff Distribution
(b)
0
0.1
0.2
0.3
0.4
0.5
100 1000
Regretpertimestep
Expected
Stochasti
Figure 2: (a) Distribution of real world ad payoffs, scaled linearly such tha
支払いが発生する
確率と,支払いが
発生する腕の割合
↓
探求と活用の
トレードオフの観点
から基準値μを決める
全腕の数 基準報酬
額
全腕の数 基準報酬
額
20. キーとなる考え方
• 腕の払う報酬の累積分布を調査
!
!
!
!
(a)
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0 0.2 0.4 0.6 0.8 1
Fractionofarms
Payoff probability (scaled)
Ad Payoff Distribution
(b)
0
0.1
0.2
0.3
0.4
0.5
100 1000
Regretpertimestep
Expected
Stochasti
Figure 2: (a) Distribution of real world ad payoffs, scaled linearly such tha
基準値μ=0.6
だった場合
腕が報酬を支払う
確率が0.6以上
→ その腕を活用
!
腕が報酬を支払う
確率が0.6未満
→ 別の腕を探索
活用探求
21. 事前調査
• 腕の払う報酬の累積分布を調査
!
!
!
!
(a)
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0 0.2 0.4 0.6 0.8 1
Fractionofarms
Payoff probability (scaled)
Ad Payoff Distribution 支払いが発生する
確率と,支払いが
発生する腕の割合
↓
探求と活用の
トレードオフの観点
から基準値μを決める
nd on the mean reward per step of any such algorithm for the state-aware
then use reductions between the different models to show that this bou
ious, timed death cases as well.
he bound assuming we always have new arms available. The expected r
om a cumulative distribution F(µ) with support in [0, 1]. For X ⇠ F(µ)
n of X over F(µ). We assume that the lifetime of an arm has an expone
meter p, and denote its expectation by L = 1/p. The following funct
tween exploration and exploitation in our setting and plays a major role
(µ) =
E[X] + (1 F(µ))(L 1)E[X|X µ]
1 + (1 F(µ))(L 1)
.
3
22. 事前調査
• 腕の払う報酬の累積分布を調査
!
!
!
!
(a)
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0 0.2 0.4 0.6 0.8 1
Fractionofarms
Payoff probability (scaled)
Ad Payoff Distribution 支払いが発生する
確率と,支払いが
発生する腕の割合
↓
探求と活用の
トレードオフの観点
から基準値μを決める
nd on the mean reward per step of any such algorithm for the state-aware
then use reductions between the different models to show that this bou
ious, timed death cases as well.
he bound assuming we always have new arms available. The expected r
om a cumulative distribution F(µ) with support in [0, 1]. For X ⇠ F(µ)
n of X over F(µ). We assume that the lifetime of an arm has an expone
meter p, and denote its expectation by L = 1/p. The following funct
tween exploration and exploitation in our setting and plays a major role
(µ) =
E[X] + (1 F(µ))(L 1)E[X|X µ]
1 + (1 F(µ))(L 1)
.
3 詳細は論文を参照!
25. Stoch. with Early Stopping
TOPT for the state-oblivious case. The intuition behind
instead of pulling an arm once to determine its payoff
d abandons it unless it looks promising. A variant, called
bandons the arm earlier if its maximum possible future
For n = O log L/✏2
, STOCHASTIC gets an expected
-optimal; the details are omitted due to space constraints.
e L
n (1)]
imes]
ever ]
Algorithm STOCH. WITH EARLY STOPPING
input: Distribution F(µ), expected lifetime L
µ⇤
argmaxµ (µ) [ is defined in (1)]
while we keep playing
[Play random arm as long as necessary]
i random new arm; r 0; d 0
while d < n and n d nµ⇤
r
Pull arm i; r r + R(µi); d d + 1
end while
if r > nµ⇤
[If it is good, stay with it forever]
Pull arm i every turn until it dies
end if
end while
ly use a standard multi-armed bandit (MAB) algorithm
AB algorithms invest a lot of pulls on all arms (at least
26. Stoch. with Early Stopping
TOPT for the state-oblivious case. The intuition behind
instead of pulling an arm once to determine its payoff
d abandons it unless it looks promising. A variant, called
bandons the arm earlier if its maximum possible future
For n = O log L/✏2
, STOCHASTIC gets an expected
-optimal; the details are omitted due to space constraints.
e L
n (1)]
imes]
ever ]
Algorithm STOCH. WITH EARLY STOPPING
input: Distribution F(µ), expected lifetime L
µ⇤
argmaxµ (µ) [ is defined in (1)]
while we keep playing
[Play random arm as long as necessary]
i random new arm; r 0; d 0
while d < n and n d nµ⇤
r
Pull arm i; r r + R(µi); d d + 1
end while
if r > nµ⇤
[If it is good, stay with it forever]
Pull arm i every turn until it dies
end if
end while
ly use a standard multi-armed bandit (MAB) algorithm
AB algorithms invest a lot of pulls on all arms (at least
入 力
!
累積分布:F(μ)
lifetime:L
!
累積分布から
想定される基準値:
μ*
活用
27. Stoch. with Early Stopping
TOPT for the state-oblivious case. The intuition behind
instead of pulling an arm once to determine its payoff
d abandons it unless it looks promising. A variant, called
bandons the arm earlier if its maximum possible future
For n = O log L/✏2
, STOCHASTIC gets an expected
-optimal; the details are omitted due to space constraints.
e L
n (1)]
imes]
ever ]
Algorithm STOCH. WITH EARLY STOPPING
input: Distribution F(µ), expected lifetime L
µ⇤
argmaxµ (µ) [ is defined in (1)]
while we keep playing
[Play random arm as long as necessary]
i random new arm; r 0; d 0
while d < n and n d nµ⇤
r
Pull arm i; r r + R(µi); d d + 1
end while
if r > nµ⇤
[If it is good, stay with it forever]
Pull arm i every turn until it dies
end if
end while
ly use a standard multi-armed bandit (MAB) algorithm
AB algorithms invest a lot of pulls on all arms (at least
これまでひいたこと
のない腕を
ランダムにひく:
腕 i
28. Stoch. with Early Stopping
TOPT for the state-oblivious case. The intuition behind
instead of pulling an arm once to determine its payoff
d abandons it unless it looks promising. A variant, called
bandons the arm earlier if its maximum possible future
For n = O log L/✏2
, STOCHASTIC gets an expected
-optimal; the details are omitted due to space constraints.
e L
n (1)]
imes]
ever ]
Algorithm STOCH. WITH EARLY STOPPING
input: Distribution F(µ), expected lifetime L
µ⇤
argmaxµ (µ) [ is defined in (1)]
while we keep playing
[Play random arm as long as necessary]
i random new arm; r 0; d 0
while d < n and n d nµ⇤
r
Pull arm i; r r + R(µi); d d + 1
end while
if r > nµ⇤
[If it is good, stay with it forever]
Pull arm i every turn until it dies
end if
end while
ly use a standard multi-armed bandit (MAB) algorithm
AB algorithms invest a lot of pulls on all arms (at least
「全腕の数ーこれま
で腕iをひいた数」
「全腕の数 基準値
ー腕iをひいて得ら
れた累積値」
腕iを
ひき続ける
29. Stoch. with Early Stopping
TOPT for the state-oblivious case. The intuition behind
instead of pulling an arm once to determine its payoff
d abandons it unless it looks promising. A variant, called
bandons the arm earlier if its maximum possible future
For n = O log L/✏2
, STOCHASTIC gets an expected
-optimal; the details are omitted due to space constraints.
e L
n (1)]
imes]
ever ]
Algorithm STOCH. WITH EARLY STOPPING
input: Distribution F(µ), expected lifetime L
µ⇤
argmaxµ (µ) [ is defined in (1)]
while we keep playing
[Play random arm as long as necessary]
i random new arm; r 0; d 0
while d < n and n d nµ⇤
r
Pull arm i; r r + R(µi); d d + 1
end while
if r > nµ⇤
[If it is good, stay with it forever]
Pull arm i every turn until it dies
end if
end while
ly use a standard multi-armed bandit (MAB) algorithm
AB algorithms invest a lot of pulls on all arms (at least
腕をひくことで
得られる報酬額が
低ければ
早くループを抜ける
30. Stoch. with Early Stopping
TOPT for the state-oblivious case. The intuition behind
instead of pulling an arm once to determine its payoff
d abandons it unless it looks promising. A variant, called
bandons the arm earlier if its maximum possible future
For n = O log L/✏2
, STOCHASTIC gets an expected
-optimal; the details are omitted due to space constraints.
e L
n (1)]
imes]
ever ]
Algorithm STOCH. WITH EARLY STOPPING
input: Distribution F(µ), expected lifetime L
µ⇤
argmaxµ (µ) [ is defined in (1)]
while we keep playing
[Play random arm as long as necessary]
i random new arm; r 0; d 0
while d < n and n d nµ⇤
r
Pull arm i; r r + R(µi); d d + 1
end while
if r > nµ⇤
[If it is good, stay with it forever]
Pull arm i every turn until it dies
end if
end while
ly use a standard multi-armed bandit (MAB) algorithm
AB algorithms invest a lot of pulls on all arms (at least
腕iをひくことで得ら
れた累積報酬額
>
全腕の数
基準報酬額
腕iのlifetime
が尽きるまで
ひき続ける
31. Stoch. with Early Stopping
TOPT for the state-oblivious case. The intuition behind
instead of pulling an arm once to determine its payoff
d abandons it unless it looks promising. A variant, called
bandons the arm earlier if its maximum possible future
For n = O log L/✏2
, STOCHASTIC gets an expected
-optimal; the details are omitted due to space constraints.
e L
n (1)]
imes]
ever ]
Algorithm STOCH. WITH EARLY STOPPING
input: Distribution F(µ), expected lifetime L
µ⇤
argmaxµ (µ) [ is defined in (1)]
while we keep playing
[Play random arm as long as necessary]
i random new arm; r 0; d 0
while d < n and n d nµ⇤
r
Pull arm i; r r + R(µi); d d + 1
end while
if r > nµ⇤
[If it is good, stay with it forever]
Pull arm i every turn until it dies
end if
end while
ly use a standard multi-armed bandit (MAB) algorithm
AB algorithms invest a lot of pulls on all arms (at least
再び,これまでひい
たことのない腕を
ランダムにひく:
腕 i