バンディットアルゴリズム勉強会

バンディットアルゴリズム勉強会
2014/05/30 第三回
@a_macbee

本日紹介する論文
• Pandey, S., Agarwal, D., Chakrabarti, D., &
Josifovski, V. (2007, January). Bandits for
Taxonomies: A Model-based Approach.
In SDM.
• Chakrabarti, D., Kumar, R., Radlinski, F., &
Upfal, E. (2008). Mortal Multi-Armed
Bandits. In NIPS (pp. 273-280).

Bandits for Taxonomies:
A Model-based Approach

広告におけるBandit Problem
• 広告における報酬：CTR，Conversion…
• CTRが非常に低いため，探索によって得られる腕の
情報量がほとんどない
• 腕同士の差異が小さい場合は，無駄な探索が増え，
結果としてepsilon-greedyよりも劣る場合もある
• 腕が増えるにつれ，収束もどんどん遅くなる
• 構造的な情報を利用することで，この問題を解決する
ことが出来ないか 
→ 分類体系（Taxonomies）に注目

• 広告における報酬：CTR，Conversion…
• CTRが非常に低いため，探索によって得られる腕の
情報量がほとんどない
• 腕同士の差異が小さい場合は，無駄な探索が増え，
結果としてepsilon-greedyよりも劣る場合もある
• 腕が増えるにつれ，収束もどんどん遅くなる
• 構造的な情報を利用することで，この問題を解決する
ことが出来ないか 
→ 分類体系（Taxonomies）に注目
arms=[0.1, 0.1, 0.1, 0.1, 0.12]
シミュレーション回数：5000
腕がひける回数：250

1. pageのparent class
を同定する (Block or row)
2. pageのparent classにとって最適な
adのparent classを同定する
3. 同じparant classに属するadの中
から，最適なadを同定 (root→leaf)
探索する腕の数が減少 → ベストな腕を早く探せる

各ブロックの更新式
α.Priorblock + (1-α).Scell / Ncell
Priorblock: ブロックのCTR
Scell: セルのクリック数
Ncell: セルのインプレッション数
→ (Scell / Ncell: 観測されたCTR)
α: 任意の値（0.0 α 1.0）
※バンディット本 p.121 に類似した式が掲載されている
任意の値αについての議論は同ページ参照

※adとpageのクラス数について

実験内容
• 1日分のログデータ（ 2.3億imp)を利用
• 25,000回腕をひく
• 40回のシミュレーション
• 以下の3つを比較
• Multi-level (提案手法)
• UCB1
• Round-robin
※pageのparent class 同定後にUCB or Round-robin

平均報酬の比較
size for set U. For U = ik, U = B⇡ik
and for U = R(i; +), U = v B
not ﬁt in B⇡ik
, ˜pik = ˆpik and U = 0 for U = ik, R(i; +).
Figure 8: UCB1 w/ shrinkage pol
0
200
400
600
800
1000
1200
0 5000 10000 15000 20000 25000
Revenue
Number of pulls
Multi-level
UCB1
Round-robin
2
3
4
5
6
7
8
9
10
11
10000
MSE
M
Ro
(a) Revenue proﬁle

まとめ
• CTRを報酬としてBanditProblemを考えた場合，
探求における情報量の少なさが問題
• CTRに相関があるように，広告配信の対象とな
るWebページ，配信される広告，それぞれをク
ラス分類することが有効

• 一般的なBandit Problemとの相違点
• 腕が払う報酬は時間毎に変化
• 腕はいずれなくなる
→ Bandit Algorithmの中に組み込む 
Mortal Multi-Armed Bandit

腕の死亡率をモデリング
• Budget death： 
表示可能回数(lifetime: L)を超えた
• Timed death： 
adが止められてしまう確率pに従って表示可能回
数が決まる (L=1/p)

キーとなる考え方
• 腕の払う報酬の累積分布を調査
!
!
!
!
(a)
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0 0.2 0.4 0.6 0.8 1
Fractionofarms
Payoff probability (scaled)
Ad Payoff Distribution
(b)
0
0.1
0.2
0.3
0.4
0.5
100 1000
Regretpertimestep
Expected
Stochasti
Figure 2: (a) Distribution of real world ad payoffs, scaled linearly such tha
支払いが発生する
確率と，支払いが
発生する腕の割合
↓
探求と活用の
トレードオフの観点
から基準値μを決める
全腕の数基準報酬
額
全腕の数基準報酬
額

キーとなる考え方
!
!
!
!
(a)
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0 0.2 0.4 0.6 0.8 1
Fractionofarms
Ad Payoff Distribution
(b)
0
0.1
0.2
0.3
0.4
0.5
100 1000
Regretpertimestep
Expected
Stochasti
Figure 2: (a) Distribution of real world ad payoffs, scaled linearly such tha
基準値μ＝0.6
だった場合
腕が報酬を支払う
確率が0.6以上
→ その腕を活用
!
腕が報酬を支払う
確率が0.6未満
→ 別の腕を探索
活用探求

事前調査
!
!
!
!
(a)
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0 0.2 0.4 0.6 0.8 1
Fractionofarms
Ad Payoff Distribution 支払いが発生する
↓
探求と活用の
nd on the mean reward per step of any such algorithm for the state-aware
then use reductions between the different models to show that this bou
ious, timed death cases as well.
he bound assuming we always have new arms available. The expected r
om a cumulative distribution F(µ) with support in [0, 1]. For X ⇠ F(µ)
n of X over F(µ). We assume that the lifetime of an arm has an expone
meter p, and denote its expectation by L = 1/p. The following funct
tween exploration and exploitation in our setting and plays a major role
(µ) =
E[X] + (1 F(µ))(L 1)E[X|X µ]
1 + (1 F(µ))(L 1)
.
3

事前調査
!
!
!
!
(a)
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0 0.2 0.4 0.6 0.8 1
Fractionofarms
Ad Payoff Distribution 支払いが発生する
↓
探求と活用の
nd on the mean reward per step of any such algorithm for the state-aware
then use reductions between the different models to show that this bou
ious, timed death cases as well.
he bound assuming we always have new arms available. The expected r
om a cumulative distribution F(µ) with support in [0, 1]. For X ⇠ F(µ)
n of X over F(µ). We assume that the lifetime of an arm has an expone
meter p, and denote its expectation by L = 1/p. The following funct
tween exploration and exploitation in our setting and plays a major role
(µ) =
E[X] + (1 F(µ))(L 1)E[X|X µ]
1 + (1 F(µ))(L 1)
.
3 詳細は論文を参照！

いまひいた腕が報酬を支払う確率は？
↓
これを可能な限り効率的に求める

論文中で紹介されている
アルゴリズムのうち，
性能の良いものを1つ紹介

Stoch. with Early Stopping
TOPT for the state-oblivious case. The intuition behind
instead of pulling an arm once to determine its payoff
d abandons it unless it looks promising. A variant, called
bandons the arm earlier if its maximum possible future
For n = O log L/✏2
, STOCHASTIC gets an expected
-optimal; the details are omitted due to space constraints.
e L
n (1)]
imes]
ever ]
Algorithm STOCH. WITH EARLY STOPPING
input: Distribution F(µ), expected lifetime L
µ⇤
argmaxµ (µ) [ is deﬁned in (1)]
while we keep playing
[Play random arm as long as necessary]
i random new arm; r 0; d 0
while d < n and n d nµ⇤
r
Pull arm i; r r + R(µi); d d + 1
end while
if r > nµ⇤
[If it is good, stay with it forever]
Pull arm i every turn until it dies
end if
end while
ly use a standard multi-armed bandit (MAB) algorithm
AB algorithms invest a lot of pulls on all arms (at least

e L
n (1)]
imes]
ever ]
µ⇤
r
end while
if r > nµ⇤
end if
end while
入力
!
累積分布：F(μ)
lifetime：L
!
累積分布から
想定される基準値：
μ*
活用

e L
n (1)]
imes]
ever ]
µ⇤
r
end while
if r > nµ⇤
end if
end while
これまでひいたこと
のない腕を
ランダムにひく：
腕 i

e L
n (1)]
imes]
ever ]
µ⇤
r
end while
if r > nµ⇤
end if
end while
「全腕の数ーこれま
で腕iをひいた数」
「全腕の数基準値
ー腕iをひいて得ら
れた累積値」
腕iを
ひき続ける

e L
n (1)]
imes]
ever ]
µ⇤
r
end while
if r > nµ⇤
end if
end while
腕をひくことで
得られる報酬額が
低ければ
早くループを抜ける

e L
n (1)]
imes]
ever ]
µ⇤
r
end while
if r > nµ⇤
end if
end while
腕iをひくことで得ら
れた累積報酬額
>
全腕の数
基準報酬額
腕iのlifetime
が尽きるまで
ひき続ける

e L
n (1)]
imes]
ever ]
µ⇤
r
end while
if r > nµ⇤
end if
end while
再び，これまでひい
たことのない腕を
ランダムにひく：
腕 i

実験内容
• 300のショッピングクラスの広告
• lifetime=100∼100,000の間で変化
• ステップごとの損失を調査
• 損失： 
現在ある腕の中でベストな腕を選択していたら
得られたであろう値から，現在得られた値を
引いたもの

(b)
0
0.1
0.2
0.3
0.4
0.5
100 1000 10000 100000
Regretpertimestep
Expected arm lifetime
Stochastic
Stochastic with Early Stopping
AdaptiveGreedy
UCB1
UCB1-k/c

まとめ
• Bandit Algorithmに腕のlifetimeという概念を
導入
• 良い腕はlifetime分使いきろうという「Stoch.
with Early Stopping」の考え方は，UCB1より
も効果的 
→ UCB1は無限に腕がひけて，無限回数Try 
した際に最終結果が良くなるはず

参考資料
• Introduction to Computational Advertising -
Stanford University 
http://www.stanford.edu/class/msande239/

バンディットアルゴリズム勉強会

Recommended

Recommended

More Related Content

Recently uploaded

Recently uploaded (20)

Featured

Featured (20)

バンディットアルゴリズム勉強会