第6回 統計・機械学習若手シンポジウムの公演で使用したユーザーサイド情報検索システムについてのスライドです。
https://sites.google.com/view/statsmlsymposium21/
Private Recommender Systems: How Can Users Build Their Own Fair Recommender Systems without Log Data? (SDM 2022) https://arxiv.org/abs/2105.12353
Retrieving Black-box Optimal Images from External Databases (WSDM 2022) https://arxiv.org/abs/2112.14921
AAAI2023「Are Transformers Effective for Time Series Forecasting?」と、HuggingFace「Yes, Transformers are Effective for Time Series Forecasting (+ Autoformer)」の紹介です。
データマイニングや機械学習をやるときによく問題となる「リーケージ」を防ぐ方法について論じた論文「Leakage in Data Mining: Formulation, Detecting, and Avoidance」(Kaufman, Shachar, et al., ACM Transactions on Knowledge Discovery from Data (TKDD) 6.4 (2012): 1-21.)を解説します。
主な内容は以下のとおりです。
・過去に起きたリーケージの事例の紹介
・リーケージを防ぐための2つの考え方
・リーケージの発見
・リーケージの修正
第6回 統計・機械学習若手シンポジウムの公演で使用したユーザーサイド情報検索システムについてのスライドです。
https://sites.google.com/view/statsmlsymposium21/
Private Recommender Systems: How Can Users Build Their Own Fair Recommender Systems without Log Data? (SDM 2022) https://arxiv.org/abs/2105.12353
Retrieving Black-box Optimal Images from External Databases (WSDM 2022) https://arxiv.org/abs/2112.14921
AAAI2023「Are Transformers Effective for Time Series Forecasting?」と、HuggingFace「Yes, Transformers are Effective for Time Series Forecasting (+ Autoformer)」の紹介です。
データマイニングや機械学習をやるときによく問題となる「リーケージ」を防ぐ方法について論じた論文「Leakage in Data Mining: Formulation, Detecting, and Avoidance」(Kaufman, Shachar, et al., ACM Transactions on Knowledge Discovery from Data (TKDD) 6.4 (2012): 1-21.)を解説します。
主な内容は以下のとおりです。
・過去に起きたリーケージの事例の紹介
・リーケージを防ぐための2つの考え方
・リーケージの発見
・リーケージの修正
6. Matrix Game Social Dilemmas
相手の戦略にかかわらず敵対する相手が敵対的な場合,
自分も敵対
相手が協力的な場合,
自分だけ裏切る
チキンゲーム スタグハントゲーム 囚人のジレンマ
7. Matrix Game Social Dilemmasとreal world
■ 実世界における社会的ジレンマの側面をいくつか無視してしまっている
– 時間軸が存在
– 協調性は段階的な量
– プレイヤは状態や他プレイヤに関する情報を部分的にしか持ってい
ない状況下で意思決定を下す
→これらの側面を捉えたSequential Social Dilemma (SSD) を提案
14. Setting
■ Gathering,Wolfpack
– 観測 : 30✕10グリッドのRGB情報
– 行動 : 上下左右に移動,移動しない,
ビームを打つ(計8個)
■ Deep Q-Networkで学習
m state s0 2 S.
" 1X
t = 0
γt
ri (st ,~at )
#
. (5)
of two-player perfectly
es obtained when |S| =
C, D } , where C and D
efect respectively.
wed starting from state s0 2 S.
( st )) ,st + 1 ⇠T ( st ,~at )
" 1X
t = 0
γt
ri (st ,~at )
#
. (5)
e the special case of two-player perfectly
s) Markov games obtained when |S| =
cify A1 = A2 = { C, D} , where C and D
cooperate and defect respectively.
s), P(s), S(s), T(s) that determine when
social dilemma are defined as follows.
= V ⇡ C
,⇡ C
1 (s) = V ⇡ C
,⇡ C
2 (s), (6)
= V ⇡ D
,⇡ D
1 (s) = V ⇡ D
,⇡ D
2 (s), (7)
Figur e 3: L eft : Gat her ing
player is dir ect ing it s bea
locat ion. T he r ed player is
fr om t he sout h. R ight : W
agent ’s view r elat ive t o t h
15. Gathering
■ 緑色のりんごを集めるゲーム
■ 二回ビームを被弾すると𝑁𝑡𝑎𝑔𝑔𝑒𝑑フレームの間ゲームから除外
■ 報酬
– りんごを取ると+1
– 取ったりんごは𝑁𝑎𝑝𝑝𝑙𝑒フレーム後に再出現
– ビームを当てること,被弾することに対しての報酬はなし
■ 𝑁𝑎𝑝𝑝𝑙𝑒と𝑁𝑡𝑎𝑔𝑔𝑒𝑑を変化させたときに,agentの敵対度合い(ビームを打つ
頻度)がどのように変化するかを分析
st )) ,st + 1 ⇠T ( st ,~at )
" 1X
t = 0
γt
ri (st ,~at )
#
. (5)
the special case of two-player perfectly
s) Markov games obtained when |S| =
fy A1 = A2 = { C, D} , where C and D
ooperate and defect respectively.
), P(s), S(s), T(s) that determine when
ocial dilemma are defined as follows.
V ⇡ C
,⇡ C
1 (s) = V ⇡ C
,⇡ C
2 (s), (6)
V ⇡ D
,⇡ D
1 (s) = V ⇡ D
,⇡ D
2 (s), (7)
V ⇡ C
,⇡ D
1 (s) = V ⇡ D
,⇡ C
2 (s), (8)
Figur e 3: L eft : Gat her ing.
player is dir ect ing it s beam
locat ion. T he r ed player is
fr om t he sout h. R ight : W o
agent ’s view r elat ive t o t he
lust rat ed. I f an agent is in
shaped region around t he
16. Gathering
■ りんごが少ない or 被弾コストが高い
→敵対的なpolicyに
■ 資源が少ない場合,agentはconflictし
やすい
■ 資源が多い場合,agentはconflictしに
くい
over A i . Each agent updates its policy given a stored batch1
of experienced transitions { (s, a, ri , s0
)t : t = 1, . . . T} such
that
Qi (s, a) Qi (s, a) + ↵ ri + γ max
a02 A i
Qi (s0
, a0
) − Qi (s, a)
This is a“ growing batch” approach to reinforcement learn-
ing in the sense of [45]. However, it does not grow in an un-
bounded fashion. Rather, old data is discarded so the batch
can be constantly refreshed with new data reflecting more
recent transitions. We compared batch sizes of 1e5 (our
default) and 1e6 in our experiments (see Sect. 5.3). The
network representing the function Q is trained through gra-
dient descent on the mean squared Bellman residual with the
expectation taken over transitions uniformly sampled from
the batch (see [25]). Since the batch is constantly refreshed,
the Q-network may adapt to the changing data distribution
arising from the e↵ects of learning on ⇡1 and ⇡2.
In order to make learning in SSDs tractable, we make
18. Gathering
■ Π 𝐶, Π 𝐷 : 𝑁𝑎𝑝𝑝𝑙𝑒/𝑁𝑡𝑎𝑔𝑔𝑒𝑑が高い・低い環境における学習済みpolicyの集合
■ 社会的ジレンマが生じたケースでは,ほとんどが囚人のジレンマになった
Figur e 6: Sum m ar y of mat r ix gam es discover ed wit hin G at her ing (L eft ) and W olfpack (R ight ) t hr ough
ext r act ing em pir ical payo↵ m at r ices. T he games ar e classifi ed by social dilem m a t ype indicat ed by color and
𝑇 − 𝑅
𝑃 − 𝑆
20. Wolfpack
■ チームの報酬が大きい or 報酬を貰え
る範囲が広い
→協力的なpolicyに
■ 2つの異なる協力的なpolicyが生まれ
た
– 最初にお互いを見つける→一緒
に移動して獲物を捉える
– 最初に獲物を見つける→相方が
来るまで待つ
Qi (s, a) Qi (s, a) + ↵ ri + γ max
a02 A i
Qi (s , a ) − Qi (s, a)
This is a“ growing batch” approach to reinforcement learn-
ing in the sense of [45]. However, it does not grow in an un-
bounded fashion. Rather, old data is discarded so the batch
can be constantly refreshed with new data reflecting more
recent transitions. We compared batch sizes of 1e5 (our
default) and 1e6 in our experiments (see Sect. 5.3). The
network representing the function Q is trained through gra-
dient descent on themean squared Bellman residual with the
expectation taken over transitions uniformly sampled from
the batch (see [25]). Since the batch is constantly refreshed,
the Q-network may adapt to the changing data distribution
arising from the e↵ects of learning on ⇡1 and ⇡2.
In order to make learning in SSDs tractable, we make
the extra assumption that each individual agent’s learning
depends only on the other agent’s learning via the (slowly)
changing distribution of experienceit generates. That is, the
two learning agents are “ independent” of one another and
each regard the other as part of the environment. From the
perspective of player one, the learning of player two shows
up as a non-stationary environment. The independence as-
sumption can be seen as a particular kind of bounded ratio-
nality: agents do no recursive reasoning about one another’s
learning. In principle, this restriction could be dropped
through the use of planning-based reinforcement learning
methods like those of [24].
Figur e 4: Social out com es ar e infl uenced by env
r onm ent par amet er s. Top: G at her ing. Shown
t he beam -use r at e (aggr essiveness) as a funct ion o
r e-spawn t ime of apples Nap p l e (abundance) and r e
spawn t ime of agent s Nt agged (confl ict -cost ). T hes
21. Wolfpack
■ Π 𝐶, Π 𝐷 : 𝑟𝑡𝑒𝑎𝑚 ∗ 𝑟𝑟𝑎𝑑𝑖𝑢𝑠が高い・低い環境における学習済みpolicyの集合
■ チキンゲーム,スタグハントゲーム,囚人のジレンマのすべてが生じた
Figur e 6: Sum m ar y of mat r ix gam es discover ed wit hin G at her ing (L eft ) and W olfpack (R ight ) t hr ough
ext r act ing em pir ical payo↵ m at r ices. T he games ar e classifi ed by social dilem m a t ype indicat ed by color and
𝑇 − 𝑅
𝑃 − 𝑆
22. Agent parameters influencing the
emergence of defection
■ 割引率
– 大きいと敵対的になりやすい
– Gathering : 他プレイヤを排除した
ほうが,後に報酬を得やすい