多様性を考慮した強化学習・機械学習
恐神 貴行
IBM Research – Tokyo
(C) Copyright IBM Corp. 2018 1
This work was supported by JST CREST Grant Number JPMJCR1304, Japan.
恐神貴行
• IBM東京基礎研究所 シニア・リサーチ・スタッフ・メンバー
• 1998年 東京大学工学部電子工学科卒
• 2005年 カーネギーメロン大学計算機科学科博士号取得
• 研究分野: 確率モデル、逐次的意思決定
(C) Copyright IBM Corp. 2018 2
多様性
(C) Copyright IBM Corp. 2018 3
Want to see relevant and diverse items in
recommendation
(C) Copyright IBM Corp. 2018 4
“trunk” by Bing (little diversity)“trunk” by Google (some diversity)
Can have more diversity
Want to take relevant and diverse actions in
team sports
Zone defense Man-to-man defense
(C) Copyright IBM Corp. 2018 5
basketballhq.com upload.wikimedia.org
Diversity also matters in soccer
(C) Copyright IBM Corp. 2018 6
RoboCupSoccer - Simulation J League (Based on data from DataStadium)
Diversity also matters in Multi-agent games
(C) Copyright IBM Corp. 2018 7
強化学習における多様性の考慮
(C) Copyright IBM Corp. 2018 8
Reinforcement learning seeks to find an
optimal policy for sequential decision making
(C) Copyright IBM Corp. 2018 9
Agent
Action
Environment
Observation, Reward
Goal: Maximize cumulative reward
An approach of reinforcement learning is to
learn the action-value function
• Action-value function: 𝑄 𝜋
• 𝑄 𝜋(𝑠, 𝑎): Expected cumulative reward from state 𝑠 by taking action 𝑎 and then
following (optimal) policy 𝜋
• Assume (relaxed later): Markovian state 𝑠 is fully observable
(C) Copyright IBM Corp. 2018 10
𝑠
𝑎
…
reward …reward reward
𝑄 𝜋(𝑠, 𝑎)
For most practical tasks, the action-value
function needs to be approximated
Exponentially large state space
• Combination of multiple factors
• e.g. Factor: “state” of each position
• History of observations
Exponentially large action space
• Combination of multiple “levers”
• Combination of multiple agents
(C) Copyright IBM Corp. 2018 11
Cannot deal with 𝑄(𝑠, 𝑎) in tabular form
Action-value functions approximated with
(deep) neural networks
(C) Copyright IBM Corp. 2018 12
state 𝑠
Distributed representation
(e.g. each unit for each pixel)
𝑄 𝜋
(𝑠, 𝑎1)
𝑄 𝜋(𝑠, 𝑎2)
Exponentially many
units for exponentially
large action space
Challenges in collaborative multi-agent
reinforcement learning
• Exponentially many combinations of actions (team-actions)
(C) Copyright IBM Corp. 2018 13
1. How to efficiently evaluate the value of team-actions
2. How to efficiently sample good team-actions
Consider the diversity of actions, in addition
to the value (relevance) of each action
(C) Copyright IBM Corp. 2018 14
Diversity
Value of team-action of 3 agents:
Volume2 = 𝑄(𝑠, 𝑎1, 𝑎3, 𝑎5 )
Diversity can be represented by determinant
(C) Copyright IBM Corp. 2018 15
Diversity
Volume2
= 𝑄 𝑠, 𝑎1, 𝑎3, 𝑎5
= det 𝐕(𝐱) 𝐕(𝐱)⊤
𝐕 =
𝑎1
𝑎2
𝑎3
𝑎4
𝑎5
:
=
𝑣11 𝑣12 𝑣13
𝑣21 𝑣22 𝑣23
𝑣31 𝑣32 𝑣33
𝑣41 𝑣42 𝑣43
𝑣51 𝑣52 𝑣53
:
𝐋 = 𝐕 𝐕⊤
𝐱 = 1, 0, 1, 0, 1, …
𝐋 𝐱 = 𝐕 𝐱 𝐕 𝐱 ⊤
Indicate
selected
actions
Our definition of diversity (similarity) in
multi-agent reinforcement learning
• The value of team-action is represented by determinant (volume)
(C) Copyright IBM Corp. 2018 16
We will learn the similarity between actions accordingly
Two actions are similar
Two actions are dissimilar
Value is low when the two actions are taken together
Value is high when the two actions are taken together
We represent the action-value function with
determinant
• 𝐳≤𝑡 ≡ (𝐳0, 𝐳1, … , 𝐳 𝑡): Time−series of observations
• 𝐳≤𝑡 = 𝑠𝑡 if Markovian state 𝑠𝑡 is observable
• 𝐱 𝑡 ≡ 𝜓 𝑎 𝑡 ∈ 0, 1 𝑁: Binary features of team-action 𝑎 𝑡
• e.g. 𝐱 𝑡 indicates which actions are selected by the team
• 𝐋 𝑡: Positive semi-definite matrix (kernel) that can depend on 𝐳≤𝑡
(C) Copyright IBM Corp. 2018 17
𝐋 𝑡(𝐱t)𝑥𝑡 𝑖 = 1
𝐋 𝑡
𝑄 𝜃 𝐳≤𝑡, 𝐱 𝑡 = 𝛼 + log det 𝐋 𝑡 𝐱 𝑡
Particular structure of the kernel for effective
learning
• 𝐋 𝑡 ≡ 𝐕 𝐃 𝑡 𝐕⊤
• 𝐕: 𝑁 × 𝐾 matrix (𝐾 ≤ 𝑁)
• 𝐃 𝑡 ≡ Diagonal exp 𝐝 𝑡 𝜙
• 𝐝 𝑡 𝜙 : differentiable time−series model with parameter 𝜙
(e.g. RNN, LSTM, DyBM, VAR)
(C) Copyright IBM Corp. 2018 18
𝑄 𝜃 𝐳≤𝑡, 𝐱 𝑡 = 𝛼 + log det 𝐋 𝑡 𝐱 𝑡
Special case of diagonal kernel reduces to the
standard approach of ignoring diversity
(C) Copyright IBM Corp. 2018 19
𝑄 𝜃 𝐳≤𝑡, 𝐱 𝑡 = 𝛼 + log det 𝐋 𝑡 𝐱 𝑡
= 𝛼 + 𝐝 𝑡 𝜙 ⊤
𝐱 𝑡
= ෍
𝑖: 𝑥 𝑡 𝑖=1
𝑑 𝑡 𝜙 𝑖
• 𝐋 𝑡 = 𝐃 𝑡 (let 𝐕 = I)
• 𝐃 𝑡 ≡ Diag exp 𝐝 𝑡 𝜙
• 𝐝 𝑡 𝜙 : differentiable time−series model
Sum of the values of selected actions
𝑑 𝑡 𝜙 𝑖: value of 𝑎𝑖 at time 𝑡
𝑥𝑡 𝑖 = 1
𝐋 𝑡
Determinantal layer for diversity
(C) Copyright IBM Corp. 2018 20
state 𝑠
𝐝 𝑡 𝜙
𝑄 𝜃 𝑠, 𝐱 𝑡 = 𝛼 + log det 𝐋 𝑡 𝐱 𝑡
𝐋 𝑡 ≡ 𝐕 Diag exp 𝐝 𝑡 𝜙 𝐕⊤
Prior work uses free-energy of restricted
Boltzmann machines [Sallans & Hinton 2001]
(C) Copyright IBM Corp. 2018 21
state 𝑠
hidden units
𝐱
𝐹 𝐱 = − log ෍
ሚ𝐡
exp −𝐸 𝐱, ሚ𝐡
𝐸 𝐱, 𝐡 ≡ −𝐛⊤
𝐱 − 𝐜⊤
𝐡 − 𝐱⊤
𝐖𝐡
𝐹 𝐱 = −𝐛⊤ 𝐱
No hidden units
𝐡 Bias 𝐛 represents individual
value of each action
強化学習による「多様性」の学習
(C) Copyright IBM Corp. 2018 22
Reinforcement learning with SARSA
• Tabular case
𝑄 𝑠𝑡, 𝑎 𝑡 ← 𝑄 𝑠𝑡, 𝑎 𝑡 + 𝜂 TD 𝑡
where TD 𝑡 ≡ 𝑟𝑡+1 + 𝜌 𝑄 𝑠𝑡+1, 𝑎 𝑡+1 − 𝑄 𝑠𝑡, 𝑎 𝑡
• With functional approximation: 𝑄 𝜃 𝑠, 𝑎 ≈ 𝑄 𝑠, 𝑎
𝜃 ← 𝜃 + 𝜂 TD 𝑡 𝛻𝜃 𝑄 𝜃(𝑠𝑡, 𝑎 𝑡)
(C) Copyright IBM Corp. 2018 23
Cumulative reward
from 𝑡
Cumulative reward
from 𝑡
Can learn our kernel 𝐋 𝑡 via SARSA in an end-
to-end manner
• 𝑄 𝜃 𝐳≤𝑡, 𝐱 𝑡 = 𝛼 + log det 𝐋 𝑡 𝐱 𝑡
• 𝐋 𝑡 ≡ 𝐕 𝐃 𝑡 𝐕⊤
• 𝐃 𝑡 ≡ Diagonal exp 𝐝 𝑡 𝜙
(C) Copyright IBM Corp. 2018 24
∇ 𝛼 𝑄 𝜃 𝐳≤𝑡, 𝐱 𝑡 = 1
∇ 𝐕(ത𝐱 𝑡) 𝑄 𝜃 𝐳≤𝑡, 𝐱 𝑡 = 𝟎
∇ 𝐕(𝐱 𝑡) 𝑄 𝜃 𝐳≤𝑡, 𝐱 𝑡 = 2 𝐕 𝐱 𝑡
+ ⊤
∇ 𝜙 𝑄 𝜃 𝐳≤𝑡, 𝐱 𝑡 = diag 𝐕 𝐱 𝑡
+
𝐕(𝐱 𝑡) ∇ 𝜙 𝐝 𝑡 𝜙
𝜃 ← 𝜃 + 𝜂 TD 𝑡 𝛻𝜃 𝑄 𝜃(𝑠𝑡, 𝑎 𝑡)
Example: Blocker Task
• Reward
• +1 when an attacker reaches the end zone
• −1 each step
• We use target positions of agents as the feature of state-action pair
• 𝐱 𝑡 ≡ 𝜓 𝑎 𝑡 = 𝑠𝑡+1 ∈ 0, 1 21
(C) Copyright IBM Corp. 2018 25
Determinantal SARSA finds a nearly optimal
policy 10 times faster than baseline methods
(C) Copyright IBM Corp. 2018 26
[Heese+ 2013]
SARSA: Free-energy SARSA
NN: Neural network
EQNAC: Energy-based Q-value natural actor critic
ENATDAC: Energy-based natural TD actor critic
Determinantal SARSA
Quality-similarity decomposition of the kernel
• Value, relevance, quality
𝑞𝑖 = 𝐿𝑖,𝑖
• Similarity
𝑆𝑖,𝑗 =
𝐿𝑖,𝑗
𝐿𝑖,𝑖 𝐿𝑗,𝑗
(C) Copyright IBM Corp. 2018 27
𝐋
𝐪
𝐪𝐒=
Value of individual actions (next positions)
learned by Determinantal SARSA
(C) Copyright IBM Corp. 2018 28
(C) Copyright IBM Corp. 2018 29
similar
dissimilar
low value when
taken together
high value when
taken together
Recall:
Our definition of action similarity
Similarity between actions (next positions)
learned by Determinantal SARSA
Stochastic Policy Task
• 2 𝑁 actions
• If action matches states
• Get +10 reward
• Hidden state transitions
(C) Copyright IBM Corp. 2018 30
00101
01011
Determinantal SARSA finds nearly optimal policies,
while baselines suffer from partial observability
(C) Copyright IBM Corp. 2018 31
(𝑁 = 5) (𝑁 = 6) (𝑁 = 8)
SARSA: Free-energy SARSA
NN: Neural network
EQNAC: Energy-based Q-value natural actor critic
ENATDAC: Energy-based natural TD actor critic
Determinantal SARSA Determinantal SARSA Determinantal SARSA
Baselines from [Heese+ 2013]
「多様」な行動の選択
(C) Copyright IBM Corp. 2018 32
Want to sample actions having high value
with high probability
• 𝜀-greedy
• Uniformly at random with probability 𝜀
• Best action (𝑎⋆ = argmax 𝑎 𝑄(𝑠, 𝑎)) with probability 1 − 𝜀
Intractable for large action space
• Boltzmann exploration
• Take action 𝑎 with probability ∼ exp(−𝛽 𝑄(𝑠, 𝑎))
• 𝛽: inverse temperature
(C) Copyright IBM Corp. 2018 33
Choosing a diverse team-action with
Boltzmann exploration
• Our value function:
𝑄 𝜃 𝐳≤𝑡, 𝐱 𝑡 = 𝛼 + log det 𝐋 𝑡 𝐱 𝑡
• Team-action selected according to Boltzmann exploration
𝜋 𝐱 𝑡 𝐳≤𝑡 =
exp 𝛽𝑄 𝜃 𝐳≤𝑡, 𝐱 𝑡
σ෤𝐱 exp 𝛽𝑄 𝜃 𝐳≤𝑡, ෤𝐱
=
det 𝐋 𝑡 𝐱 𝑡
𝛽
σ෤𝐱 det 𝐋 𝑡 ෤𝐱 𝛽
(C) Copyright IBM Corp. 2018 34
𝐱 𝑡 ≡ 𝜓 𝑎 𝑡 ∈ 0, 1 𝑁:
Binary features of team-action 𝑎 𝑡
Boltzmann exploration can be performed
efficiently when 𝛽 = 1
𝜋 𝐱 𝑡 𝐳≤𝑡 =
det 𝐋 𝑡 𝐱 𝑡
σ෤𝐱 det 𝐋 𝑡 ෤𝐱
=
det 𝐋 𝑡 𝐱 𝑡
det 𝐋 𝑡 + 𝐈
(C) Copyright IBM Corp. 2018 35
Sum over 2 𝑁 terms Determinant of 𝑁 × 𝑁 matrix
Determinantal point process (DPP) has been
studied for modeling and learning diversity
• Probability of selecting a subset 𝒳 [Macchi 1975]:
ℙ 𝒳 =
det 𝐋 𝒳
σ ෡𝒳
det 𝐋 ෡𝒳
=
det 𝐋 𝒳
det 𝐋 + 𝐈
• Sampling in 𝑂 𝑁 𝑘3
time [Hough+ 2006]
• 𝑁: size of ground set
• 𝑘: number of items in sampled subset
(C) Copyright IBM Corp. 2018 36
[Kulesza & Taskar (2012)]
Prior work has been applying DPPs in
machine learning and artificial intelligence
• Recommendation [Gillenwater+ 2014, Gartrell+ 2017]
• Summarization of document, videos, … [Gong+ 2014]
• Hyper-parameter optimization [Kathuria+ 2016]
• Mini-batch sampling [Zhang+ 2017]
• Modeling neural spiking [Snoek+ 2013]
(C) Copyright IBM Corp. 2018 37
Nearly exact Boltzmann exploration with
MCMC [Kang 2013 (for 𝛽 = 1)]
1. Initialize 𝐱
2. Repeat
• 𝐱 ← 𝐱′ with probability min 1,
det 𝐋 𝑡 𝐱′
det 𝐋 𝑡 𝐱
𝛽
• When 𝐱’ and 𝐱 differs by one bit, det 𝐋 𝑡 𝐱′
can be computed from
det 𝐋 𝑡 𝐱 via rank-one update techniques
• Assume we can find 𝑎 ← 𝜓−1(𝐱)
(C) Copyright IBM Corp. 2018 38
Simpler approaches:
Heuristics for more exploration and exploitation
• For more exploration
• Uniform with probability 𝜀
• DPP with probability 1 − 𝜀
• For more exploitation
• Sample 𝑀 actions according to DPP
• Choose the best among 𝑀
• Hard-core point processes [Matérn 1986]
(C) Copyright IBM Corp. 2018 39
Summary
(C) Copyright IBM Corp. 2018 40
(C) Copyright IBM Corp. 2018 41
Need diversity in team-actions Our definition of action similarity
similar
dissimilar
low value when
taken together
high value when
taken together
Challenge
𝑄 𝜃 𝐳≤𝑡, 𝐱 𝑡 = 𝛼 + log det 𝐋 𝑡 𝐱 𝑡
Our solution
Exponentially many team-actions
Efficient learning
∇ 𝐕(𝐱 𝑡) 𝑄 𝜃 𝐳≤𝑡, 𝐱 𝑡 = 2 𝐕 𝐱 𝑡
+ ⊤
∇ 𝜙 𝑄 𝜃 𝐳≤𝑡, 𝐱 𝑡 = diag 𝐕 𝐱 𝑡
+ 𝐕(𝐱 𝑡) ∇ 𝜙 𝐝 𝑡 𝜙
Efficient sampling (cf. DPP)
𝜋 𝐱 𝑡 𝐳≤𝑡 =
exp 𝛽𝑄 𝜃 𝐳≤𝑡, 𝐱 𝑡
σ෤𝐱 exp 𝛽𝑄 𝜃 𝐳≤𝑡, ෤𝐱
=
det 𝐋 𝑡 𝐱 𝑡
𝛽
σ෤𝐱 det 𝐋 𝑡 ෤𝐱 𝛽
References
• T. Osogami and R. Raymond, Determinantal reinforcement learning,
AAAI-19
• T. Osogami and R. Raymond, Determinantal SARSA: Toward deep
reinforcement learning for collaborative agents, NIPS 2017 Deep
Reinforcement Learning Symposium
• T. Osogami, R. Raymond, A. Goel, T. Shirai, and T. Maehara, Dynamic
determinantal point processes, AAAI-18
• K. Zhao, T. Morimura, and T. Osogami, Event-based visual analytics for
team-based invasion sports, PacificVis 2018
(C) Copyright IBM Corp. 2018 42

第19回ステアラボ人工知能セミナー発表資料

  • 1.
    多様性を考慮した強化学習・機械学習 恐神 貴行 IBM Research– Tokyo (C) Copyright IBM Corp. 2018 1 This work was supported by JST CREST Grant Number JPMJCR1304, Japan.
  • 2.
    恐神貴行 • IBM東京基礎研究所 シニア・リサーチ・スタッフ・メンバー •1998年 東京大学工学部電子工学科卒 • 2005年 カーネギーメロン大学計算機科学科博士号取得 • 研究分野: 確率モデル、逐次的意思決定 (C) Copyright IBM Corp. 2018 2
  • 3.
  • 4.
    Want to seerelevant and diverse items in recommendation (C) Copyright IBM Corp. 2018 4 “trunk” by Bing (little diversity)“trunk” by Google (some diversity) Can have more diversity
  • 5.
    Want to takerelevant and diverse actions in team sports Zone defense Man-to-man defense (C) Copyright IBM Corp. 2018 5 basketballhq.com upload.wikimedia.org
  • 6.
    Diversity also mattersin soccer (C) Copyright IBM Corp. 2018 6 RoboCupSoccer - Simulation J League (Based on data from DataStadium)
  • 7.
    Diversity also mattersin Multi-agent games (C) Copyright IBM Corp. 2018 7
  • 8.
  • 9.
    Reinforcement learning seeksto find an optimal policy for sequential decision making (C) Copyright IBM Corp. 2018 9 Agent Action Environment Observation, Reward Goal: Maximize cumulative reward
  • 10.
    An approach ofreinforcement learning is to learn the action-value function • Action-value function: 𝑄 𝜋 • 𝑄 𝜋(𝑠, 𝑎): Expected cumulative reward from state 𝑠 by taking action 𝑎 and then following (optimal) policy 𝜋 • Assume (relaxed later): Markovian state 𝑠 is fully observable (C) Copyright IBM Corp. 2018 10 𝑠 𝑎 … reward …reward reward 𝑄 𝜋(𝑠, 𝑎)
  • 11.
    For most practicaltasks, the action-value function needs to be approximated Exponentially large state space • Combination of multiple factors • e.g. Factor: “state” of each position • History of observations Exponentially large action space • Combination of multiple “levers” • Combination of multiple agents (C) Copyright IBM Corp. 2018 11 Cannot deal with 𝑄(𝑠, 𝑎) in tabular form
  • 12.
    Action-value functions approximatedwith (deep) neural networks (C) Copyright IBM Corp. 2018 12 state 𝑠 Distributed representation (e.g. each unit for each pixel) 𝑄 𝜋 (𝑠, 𝑎1) 𝑄 𝜋(𝑠, 𝑎2) Exponentially many units for exponentially large action space
  • 13.
    Challenges in collaborativemulti-agent reinforcement learning • Exponentially many combinations of actions (team-actions) (C) Copyright IBM Corp. 2018 13 1. How to efficiently evaluate the value of team-actions 2. How to efficiently sample good team-actions
  • 14.
    Consider the diversityof actions, in addition to the value (relevance) of each action (C) Copyright IBM Corp. 2018 14 Diversity Value of team-action of 3 agents: Volume2 = 𝑄(𝑠, 𝑎1, 𝑎3, 𝑎5 )
  • 15.
    Diversity can berepresented by determinant (C) Copyright IBM Corp. 2018 15 Diversity Volume2 = 𝑄 𝑠, 𝑎1, 𝑎3, 𝑎5 = det 𝐕(𝐱) 𝐕(𝐱)⊤ 𝐕 = 𝑎1 𝑎2 𝑎3 𝑎4 𝑎5 : = 𝑣11 𝑣12 𝑣13 𝑣21 𝑣22 𝑣23 𝑣31 𝑣32 𝑣33 𝑣41 𝑣42 𝑣43 𝑣51 𝑣52 𝑣53 : 𝐋 = 𝐕 𝐕⊤ 𝐱 = 1, 0, 1, 0, 1, … 𝐋 𝐱 = 𝐕 𝐱 𝐕 𝐱 ⊤ Indicate selected actions
  • 16.
    Our definition ofdiversity (similarity) in multi-agent reinforcement learning • The value of team-action is represented by determinant (volume) (C) Copyright IBM Corp. 2018 16 We will learn the similarity between actions accordingly Two actions are similar Two actions are dissimilar Value is low when the two actions are taken together Value is high when the two actions are taken together
  • 17.
    We represent theaction-value function with determinant • 𝐳≤𝑡 ≡ (𝐳0, 𝐳1, … , 𝐳 𝑡): Time−series of observations • 𝐳≤𝑡 = 𝑠𝑡 if Markovian state 𝑠𝑡 is observable • 𝐱 𝑡 ≡ 𝜓 𝑎 𝑡 ∈ 0, 1 𝑁: Binary features of team-action 𝑎 𝑡 • e.g. 𝐱 𝑡 indicates which actions are selected by the team • 𝐋 𝑡: Positive semi-definite matrix (kernel) that can depend on 𝐳≤𝑡 (C) Copyright IBM Corp. 2018 17 𝐋 𝑡(𝐱t)𝑥𝑡 𝑖 = 1 𝐋 𝑡 𝑄 𝜃 𝐳≤𝑡, 𝐱 𝑡 = 𝛼 + log det 𝐋 𝑡 𝐱 𝑡
  • 18.
    Particular structure ofthe kernel for effective learning • 𝐋 𝑡 ≡ 𝐕 𝐃 𝑡 𝐕⊤ • 𝐕: 𝑁 × 𝐾 matrix (𝐾 ≤ 𝑁) • 𝐃 𝑡 ≡ Diagonal exp 𝐝 𝑡 𝜙 • 𝐝 𝑡 𝜙 : differentiable time−series model with parameter 𝜙 (e.g. RNN, LSTM, DyBM, VAR) (C) Copyright IBM Corp. 2018 18 𝑄 𝜃 𝐳≤𝑡, 𝐱 𝑡 = 𝛼 + log det 𝐋 𝑡 𝐱 𝑡
  • 19.
    Special case ofdiagonal kernel reduces to the standard approach of ignoring diversity (C) Copyright IBM Corp. 2018 19 𝑄 𝜃 𝐳≤𝑡, 𝐱 𝑡 = 𝛼 + log det 𝐋 𝑡 𝐱 𝑡 = 𝛼 + 𝐝 𝑡 𝜙 ⊤ 𝐱 𝑡 = ෍ 𝑖: 𝑥 𝑡 𝑖=1 𝑑 𝑡 𝜙 𝑖 • 𝐋 𝑡 = 𝐃 𝑡 (let 𝐕 = I) • 𝐃 𝑡 ≡ Diag exp 𝐝 𝑡 𝜙 • 𝐝 𝑡 𝜙 : differentiable time−series model Sum of the values of selected actions 𝑑 𝑡 𝜙 𝑖: value of 𝑎𝑖 at time 𝑡 𝑥𝑡 𝑖 = 1 𝐋 𝑡
  • 20.
    Determinantal layer fordiversity (C) Copyright IBM Corp. 2018 20 state 𝑠 𝐝 𝑡 𝜙 𝑄 𝜃 𝑠, 𝐱 𝑡 = 𝛼 + log det 𝐋 𝑡 𝐱 𝑡 𝐋 𝑡 ≡ 𝐕 Diag exp 𝐝 𝑡 𝜙 𝐕⊤
  • 21.
    Prior work usesfree-energy of restricted Boltzmann machines [Sallans & Hinton 2001] (C) Copyright IBM Corp. 2018 21 state 𝑠 hidden units 𝐱 𝐹 𝐱 = − log ෍ ሚ𝐡 exp −𝐸 𝐱, ሚ𝐡 𝐸 𝐱, 𝐡 ≡ −𝐛⊤ 𝐱 − 𝐜⊤ 𝐡 − 𝐱⊤ 𝐖𝐡 𝐹 𝐱 = −𝐛⊤ 𝐱 No hidden units 𝐡 Bias 𝐛 represents individual value of each action
  • 22.
  • 23.
    Reinforcement learning withSARSA • Tabular case 𝑄 𝑠𝑡, 𝑎 𝑡 ← 𝑄 𝑠𝑡, 𝑎 𝑡 + 𝜂 TD 𝑡 where TD 𝑡 ≡ 𝑟𝑡+1 + 𝜌 𝑄 𝑠𝑡+1, 𝑎 𝑡+1 − 𝑄 𝑠𝑡, 𝑎 𝑡 • With functional approximation: 𝑄 𝜃 𝑠, 𝑎 ≈ 𝑄 𝑠, 𝑎 𝜃 ← 𝜃 + 𝜂 TD 𝑡 𝛻𝜃 𝑄 𝜃(𝑠𝑡, 𝑎 𝑡) (C) Copyright IBM Corp. 2018 23 Cumulative reward from 𝑡 Cumulative reward from 𝑡
  • 24.
    Can learn ourkernel 𝐋 𝑡 via SARSA in an end- to-end manner • 𝑄 𝜃 𝐳≤𝑡, 𝐱 𝑡 = 𝛼 + log det 𝐋 𝑡 𝐱 𝑡 • 𝐋 𝑡 ≡ 𝐕 𝐃 𝑡 𝐕⊤ • 𝐃 𝑡 ≡ Diagonal exp 𝐝 𝑡 𝜙 (C) Copyright IBM Corp. 2018 24 ∇ 𝛼 𝑄 𝜃 𝐳≤𝑡, 𝐱 𝑡 = 1 ∇ 𝐕(ത𝐱 𝑡) 𝑄 𝜃 𝐳≤𝑡, 𝐱 𝑡 = 𝟎 ∇ 𝐕(𝐱 𝑡) 𝑄 𝜃 𝐳≤𝑡, 𝐱 𝑡 = 2 𝐕 𝐱 𝑡 + ⊤ ∇ 𝜙 𝑄 𝜃 𝐳≤𝑡, 𝐱 𝑡 = diag 𝐕 𝐱 𝑡 + 𝐕(𝐱 𝑡) ∇ 𝜙 𝐝 𝑡 𝜙 𝜃 ← 𝜃 + 𝜂 TD 𝑡 𝛻𝜃 𝑄 𝜃(𝑠𝑡, 𝑎 𝑡)
  • 25.
    Example: Blocker Task •Reward • +1 when an attacker reaches the end zone • −1 each step • We use target positions of agents as the feature of state-action pair • 𝐱 𝑡 ≡ 𝜓 𝑎 𝑡 = 𝑠𝑡+1 ∈ 0, 1 21 (C) Copyright IBM Corp. 2018 25
  • 26.
    Determinantal SARSA findsa nearly optimal policy 10 times faster than baseline methods (C) Copyright IBM Corp. 2018 26 [Heese+ 2013] SARSA: Free-energy SARSA NN: Neural network EQNAC: Energy-based Q-value natural actor critic ENATDAC: Energy-based natural TD actor critic Determinantal SARSA
  • 27.
    Quality-similarity decomposition ofthe kernel • Value, relevance, quality 𝑞𝑖 = 𝐿𝑖,𝑖 • Similarity 𝑆𝑖,𝑗 = 𝐿𝑖,𝑗 𝐿𝑖,𝑖 𝐿𝑗,𝑗 (C) Copyright IBM Corp. 2018 27 𝐋 𝐪 𝐪𝐒=
  • 28.
    Value of individualactions (next positions) learned by Determinantal SARSA (C) Copyright IBM Corp. 2018 28
  • 29.
    (C) Copyright IBMCorp. 2018 29 similar dissimilar low value when taken together high value when taken together Recall: Our definition of action similarity Similarity between actions (next positions) learned by Determinantal SARSA
  • 30.
    Stochastic Policy Task •2 𝑁 actions • If action matches states • Get +10 reward • Hidden state transitions (C) Copyright IBM Corp. 2018 30 00101 01011
  • 31.
    Determinantal SARSA findsnearly optimal policies, while baselines suffer from partial observability (C) Copyright IBM Corp. 2018 31 (𝑁 = 5) (𝑁 = 6) (𝑁 = 8) SARSA: Free-energy SARSA NN: Neural network EQNAC: Energy-based Q-value natural actor critic ENATDAC: Energy-based natural TD actor critic Determinantal SARSA Determinantal SARSA Determinantal SARSA Baselines from [Heese+ 2013]
  • 32.
  • 33.
    Want to sampleactions having high value with high probability • 𝜀-greedy • Uniformly at random with probability 𝜀 • Best action (𝑎⋆ = argmax 𝑎 𝑄(𝑠, 𝑎)) with probability 1 − 𝜀 Intractable for large action space • Boltzmann exploration • Take action 𝑎 with probability ∼ exp(−𝛽 𝑄(𝑠, 𝑎)) • 𝛽: inverse temperature (C) Copyright IBM Corp. 2018 33
  • 34.
    Choosing a diverseteam-action with Boltzmann exploration • Our value function: 𝑄 𝜃 𝐳≤𝑡, 𝐱 𝑡 = 𝛼 + log det 𝐋 𝑡 𝐱 𝑡 • Team-action selected according to Boltzmann exploration 𝜋 𝐱 𝑡 𝐳≤𝑡 = exp 𝛽𝑄 𝜃 𝐳≤𝑡, 𝐱 𝑡 σ෤𝐱 exp 𝛽𝑄 𝜃 𝐳≤𝑡, ෤𝐱 = det 𝐋 𝑡 𝐱 𝑡 𝛽 σ෤𝐱 det 𝐋 𝑡 ෤𝐱 𝛽 (C) Copyright IBM Corp. 2018 34 𝐱 𝑡 ≡ 𝜓 𝑎 𝑡 ∈ 0, 1 𝑁: Binary features of team-action 𝑎 𝑡
  • 35.
    Boltzmann exploration canbe performed efficiently when 𝛽 = 1 𝜋 𝐱 𝑡 𝐳≤𝑡 = det 𝐋 𝑡 𝐱 𝑡 σ෤𝐱 det 𝐋 𝑡 ෤𝐱 = det 𝐋 𝑡 𝐱 𝑡 det 𝐋 𝑡 + 𝐈 (C) Copyright IBM Corp. 2018 35 Sum over 2 𝑁 terms Determinant of 𝑁 × 𝑁 matrix
  • 36.
    Determinantal point process(DPP) has been studied for modeling and learning diversity • Probability of selecting a subset 𝒳 [Macchi 1975]: ℙ 𝒳 = det 𝐋 𝒳 σ ෡𝒳 det 𝐋 ෡𝒳 = det 𝐋 𝒳 det 𝐋 + 𝐈 • Sampling in 𝑂 𝑁 𝑘3 time [Hough+ 2006] • 𝑁: size of ground set • 𝑘: number of items in sampled subset (C) Copyright IBM Corp. 2018 36 [Kulesza & Taskar (2012)]
  • 37.
    Prior work hasbeen applying DPPs in machine learning and artificial intelligence • Recommendation [Gillenwater+ 2014, Gartrell+ 2017] • Summarization of document, videos, … [Gong+ 2014] • Hyper-parameter optimization [Kathuria+ 2016] • Mini-batch sampling [Zhang+ 2017] • Modeling neural spiking [Snoek+ 2013] (C) Copyright IBM Corp. 2018 37
  • 38.
    Nearly exact Boltzmannexploration with MCMC [Kang 2013 (for 𝛽 = 1)] 1. Initialize 𝐱 2. Repeat • 𝐱 ← 𝐱′ with probability min 1, det 𝐋 𝑡 𝐱′ det 𝐋 𝑡 𝐱 𝛽 • When 𝐱’ and 𝐱 differs by one bit, det 𝐋 𝑡 𝐱′ can be computed from det 𝐋 𝑡 𝐱 via rank-one update techniques • Assume we can find 𝑎 ← 𝜓−1(𝐱) (C) Copyright IBM Corp. 2018 38
  • 39.
    Simpler approaches: Heuristics formore exploration and exploitation • For more exploration • Uniform with probability 𝜀 • DPP with probability 1 − 𝜀 • For more exploitation • Sample 𝑀 actions according to DPP • Choose the best among 𝑀 • Hard-core point processes [Matérn 1986] (C) Copyright IBM Corp. 2018 39
  • 40.
  • 41.
    (C) Copyright IBMCorp. 2018 41 Need diversity in team-actions Our definition of action similarity similar dissimilar low value when taken together high value when taken together Challenge 𝑄 𝜃 𝐳≤𝑡, 𝐱 𝑡 = 𝛼 + log det 𝐋 𝑡 𝐱 𝑡 Our solution Exponentially many team-actions Efficient learning ∇ 𝐕(𝐱 𝑡) 𝑄 𝜃 𝐳≤𝑡, 𝐱 𝑡 = 2 𝐕 𝐱 𝑡 + ⊤ ∇ 𝜙 𝑄 𝜃 𝐳≤𝑡, 𝐱 𝑡 = diag 𝐕 𝐱 𝑡 + 𝐕(𝐱 𝑡) ∇ 𝜙 𝐝 𝑡 𝜙 Efficient sampling (cf. DPP) 𝜋 𝐱 𝑡 𝐳≤𝑡 = exp 𝛽𝑄 𝜃 𝐳≤𝑡, 𝐱 𝑡 σ෤𝐱 exp 𝛽𝑄 𝜃 𝐳≤𝑡, ෤𝐱 = det 𝐋 𝑡 𝐱 𝑡 𝛽 σ෤𝐱 det 𝐋 𝑡 ෤𝐱 𝛽
  • 42.
    References • T. Osogamiand R. Raymond, Determinantal reinforcement learning, AAAI-19 • T. Osogami and R. Raymond, Determinantal SARSA: Toward deep reinforcement learning for collaborative agents, NIPS 2017 Deep Reinforcement Learning Symposium • T. Osogami, R. Raymond, A. Goel, T. Shirai, and T. Maehara, Dynamic determinantal point processes, AAAI-18 • K. Zhao, T. Morimura, and T. Osogami, Event-based visual analytics for team-based invasion sports, PacificVis 2018 (C) Copyright IBM Corp. 2018 42