第19回ステアラボ人工知能セミナー発表資料

多様性を考慮した強化学習・機械学習
恐神貴行
IBM Research – Tokyo
(C) Copyright IBM Corp. 2018 1
This work was supported by JST CREST Grant Number JPMJCR1304, Japan.

恐神貴行
• IBM東京基礎研究所シニア・リサーチ・スタッフ・メンバー
• １９９８年東京大学工学部電子工学科卒
• ２００５年カーネギーメロン大学計算機科学科博士号取得
• 研究分野：確率モデル、逐次的意思決定

多様性

Want to see relevant and diverse items in
recommendation
“trunk” by Bing (little diversity)“trunk” by Google (some diversity)
Can have more diversity

Want to take relevant and diverse actions in
team sports
Zone defense Man-to-man defense
basketballhq.com upload.wikimedia.org

Diversity also matters in soccer
RoboCupSoccer - Simulation J League (Based on data from DataStadium)

Diversity also matters in Multi-agent games

強化学習における多様性の考慮

Reinforcement learning seeks to find an
optimal policy for sequential decision making
Agent
Action
Environment
Observation, Reward
Goal: Maximize cumulative reward

An approach of reinforcement learning is to
learn the action-value function
• Action-value function: 𝑄 𝜋
• 𝑄 𝜋(𝑠, 𝑎): Expected cumulative reward from state 𝑠 by taking action 𝑎 and then
following (optimal) policy 𝜋
• Assume (relaxed later): Markovian state 𝑠 is fully observable
𝑠
𝑎
…
reward …reward reward
𝑄 𝜋(𝑠, 𝑎)

For most practical tasks, the action-value
function needs to be approximated
Exponentially large state space
• Combination of multiple factors
• e.g. Factor: “state” of each position
• History of observations
Exponentially large action space
• Combination of multiple “levers”
• Combination of multiple agents
Cannot deal with 𝑄(𝑠, 𝑎) in tabular form

Action-value functions approximated with
(deep) neural networks
state 𝑠
Distributed representation
(e.g. each unit for each pixel)
𝑄 𝜋
(𝑠, 𝑎1)
𝑄 𝜋(𝑠, 𝑎2)
Exponentially many
units for exponentially
large action space

Challenges in collaborative multi-agent
reinforcement learning
• Exponentially many combinations of actions (team-actions)
1. How to efficiently evaluate the value of team-actions
2. How to efficiently sample good team-actions

Consider the diversity of actions, in addition
to the value (relevance) of each action
Diversity
Value of team-action of 3 agents:
Volume2 = 𝑄(𝑠, 𝑎1, 𝑎3, 𝑎5 )

Diversity can be represented by determinant
Diversity
Volume2
= 𝑄 𝑠, 𝑎1, 𝑎3, 𝑎5
= det 𝐕(𝐱) 𝐕(𝐱)⊤
𝐕 =
𝑎1
𝑎2
𝑎3
𝑎4
𝑎5
:
=
𝑣11 𝑣12 𝑣13
𝑣21 𝑣22 𝑣23
𝑣31 𝑣32 𝑣33
𝑣41 𝑣42 𝑣43
𝑣51 𝑣52 𝑣53
:
𝐋 = 𝐕 𝐕⊤
𝐱 = 1, 0, 1, 0, 1, …
𝐋 𝐱 = 𝐕 𝐱 𝐕 𝐱 ⊤
Indicate
selected
actions

Our definition of diversity (similarity) in
multi-agent reinforcement learning
• The value of team-action is represented by determinant (volume)
We will learn the similarity between actions accordingly
Two actions are similar
Two actions are dissimilar
Value is low when the two actions are taken together
Value is high when the two actions are taken together

We represent the action-value function with
determinant
• 𝐳≤𝑡 ≡ (𝐳0, 𝐳1, … , 𝐳 𝑡): Time−series of observations
• 𝐳≤𝑡 = 𝑠𝑡 if Markovian state 𝑠𝑡 is observable
• 𝐱 𝑡 ≡ 𝜓 𝑎 𝑡 ∈ 0, 1 𝑁: Binary features of team-action 𝑎 𝑡
• e.g. 𝐱 𝑡 indicates which actions are selected by the team
• 𝐋 𝑡: Positive semi-definite matrix (kernel) that can depend on 𝐳≤𝑡
𝐋 𝑡(𝐱t)𝑥𝑡 𝑖 = 1
𝐋 𝑡
𝑄 𝜃 𝐳≤𝑡, 𝐱 𝑡 = 𝛼 + log det 𝐋 𝑡 𝐱 𝑡

Particular structure of the kernel for effective
learning
• 𝐋 𝑡 ≡ 𝐕 𝐃 𝑡 𝐕⊤
• 𝐕: 𝑁 × 𝐾 matrix (𝐾 ≤ 𝑁)
• 𝐃 𝑡 ≡ Diagonal exp 𝐝 𝑡 𝜙
• 𝐝 𝑡 𝜙 : differentiable time−series model with parameter 𝜙
(e.g. RNN, LSTM, DyBM, VAR)

Special case of diagonal kernel reduces to the
standard approach of ignoring diversity
= 𝛼 + 𝐝 𝑡 𝜙 ⊤
𝐱 𝑡
= ෍
𝑖: 𝑥 𝑡 𝑖=1
𝑑 𝑡 𝜙 𝑖
• 𝐋 𝑡 = 𝐃 𝑡 (let 𝐕 = I)
• 𝐃 𝑡 ≡ Diag exp 𝐝 𝑡 𝜙
• 𝐝 𝑡 𝜙 : differentiable time−series model
Sum of the values of selected actions
𝑑 𝑡 𝜙 𝑖: value of 𝑎𝑖 at time 𝑡
𝑥𝑡 𝑖 = 1
𝐋 𝑡

Determinantal layer for diversity
state 𝑠
𝐝 𝑡 𝜙
𝑄 𝜃 𝑠, 𝐱 𝑡 = 𝛼 + log det 𝐋 𝑡 𝐱 𝑡
𝐋 𝑡 ≡ 𝐕 Diag exp 𝐝 𝑡 𝜙 𝐕⊤

Prior work uses free-energy of restricted
Boltzmann machines [Sallans & Hinton 2001]
state 𝑠
hidden units
𝐱
𝐹 𝐱 = − log ෍
ሚ𝐡
exp −𝐸 𝐱, ሚ𝐡
𝐸 𝐱, 𝐡 ≡ −𝐛⊤
𝐱 − 𝐜⊤
𝐡 − 𝐱⊤
𝐖𝐡
𝐹 𝐱 = −𝐛⊤ 𝐱
No hidden units
𝐡 Bias 𝐛 represents individual
value of each action

強化学習による「多様性」の学習

Reinforcement learning with SARSA
• Tabular case
𝑄 𝑠𝑡, 𝑎 𝑡 ← 𝑄 𝑠𝑡, 𝑎 𝑡 + 𝜂 TD 𝑡
where TD 𝑡 ≡ 𝑟𝑡+1 + 𝜌 𝑄 𝑠𝑡+1, 𝑎 𝑡+1 − 𝑄 𝑠𝑡, 𝑎 𝑡
• With functional approximation: 𝑄 𝜃 𝑠, 𝑎 ≈ 𝑄 𝑠, 𝑎
𝜃 ← 𝜃 + 𝜂 TD 𝑡 𝛻𝜃 𝑄 𝜃(𝑠𝑡, 𝑎 𝑡)
Cumulative reward
from 𝑡
Cumulative reward
from 𝑡

Can learn our kernel 𝐋 𝑡 via SARSA in an end-
to-end manner
• 𝑄 𝜃 𝐳≤𝑡, 𝐱 𝑡 = 𝛼 + log det 𝐋 𝑡 𝐱 𝑡
• 𝐋 𝑡 ≡ 𝐕 𝐃 𝑡 𝐕⊤
• 𝐃 𝑡 ≡ Diagonal exp 𝐝 𝑡 𝜙
∇ 𝛼 𝑄 𝜃 𝐳≤𝑡, 𝐱 𝑡 = 1
∇ 𝐕(ത𝐱 𝑡) 𝑄 𝜃 𝐳≤𝑡, 𝐱 𝑡 = 𝟎
∇ 𝐕(𝐱 𝑡) 𝑄 𝜃 𝐳≤𝑡, 𝐱 𝑡 = 2 𝐕 𝐱 𝑡
+ ⊤
∇ 𝜙 𝑄 𝜃 𝐳≤𝑡, 𝐱 𝑡 = diag 𝐕 𝐱 𝑡
+
𝐕(𝐱 𝑡) ∇ 𝜙 𝐝 𝑡 𝜙
𝜃 ← 𝜃 + 𝜂 TD 𝑡 𝛻𝜃 𝑄 𝜃(𝑠𝑡, 𝑎 𝑡)

Example: Blocker Task
• Reward
• +1 when an attacker reaches the end zone
• −1 each step
• We use target positions of agents as the feature of state-action pair
• 𝐱 𝑡 ≡ 𝜓 𝑎 𝑡 = 𝑠𝑡+1 ∈ 0, 1 21

Determinantal SARSA finds a nearly optimal
policy 10 times faster than baseline methods
[Heese+ 2013]
SARSA: Free-energy SARSA
NN: Neural network
EQNAC: Energy-based Q-value natural actor critic
ENATDAC: Energy-based natural TD actor critic
Determinantal SARSA

Quality-similarity decomposition of the kernel
• Value, relevance, quality
𝑞𝑖 = 𝐿𝑖,𝑖
• Similarity
𝑆𝑖,𝑗 =
𝐿𝑖,𝑗
𝐿𝑖,𝑖 𝐿𝑗,𝑗
𝐋
𝐪
𝐪𝐒=

Value of individual actions (next positions)
learned by Determinantal SARSA

similar
dissimilar
low value when
taken together
high value when
taken together
Recall:
Our definition of action similarity
Similarity between actions (next positions)
learned by Determinantal SARSA

Stochastic Policy Task
• 2 𝑁 actions
• If action matches states
• Get +10 reward
• Hidden state transitions
00101
01011

Determinantal SARSA finds nearly optimal policies,
while baselines suffer from partial observability
(𝑁 = 5) (𝑁 = 6) (𝑁 = 8)
SARSA: Free-energy SARSA
NN: Neural network
EQNAC: Energy-based Q-value natural actor critic
ENATDAC: Energy-based natural TD actor critic
Determinantal SARSA Determinantal SARSA Determinantal SARSA
Baselines from [Heese+ 2013]

「多様」な行動の選択

Want to sample actions having high value
with high probability
• 𝜀-greedy
• Uniformly at random with probability 𝜀
• Best action (𝑎⋆ = argmax 𝑎 𝑄(𝑠, 𝑎)) with probability 1 − 𝜀
Intractable for large action space
• Boltzmann exploration
• Take action 𝑎 with probability ∼ exp(−𝛽 𝑄(𝑠, 𝑎))
• 𝛽: inverse temperature

Choosing a diverse team-action with
Boltzmann exploration
• Our value function:
• Team-action selected according to Boltzmann exploration
𝜋 𝐱 𝑡 𝐳≤𝑡 =
exp 𝛽𝑄 𝜃 𝐳≤𝑡, 𝐱 𝑡
σ෤𝐱 exp 𝛽𝑄 𝜃 𝐳≤𝑡, ෤𝐱
=
det 𝐋 𝑡 𝐱 𝑡
𝛽
σ෤𝐱 det 𝐋 𝑡 ෤𝐱 𝛽
𝐱 𝑡 ≡ 𝜓 𝑎 𝑡 ∈ 0, 1 𝑁:
Binary features of team-action 𝑎 𝑡

Boltzmann exploration can be performed
efficiently when 𝛽 = 1
σ෤𝐱 det 𝐋 𝑡 ෤𝐱
=
det 𝐋 𝑡 + 𝐈
Sum over 2 𝑁 terms Determinant of 𝑁 × 𝑁 matrix

Determinantal point process (DPP) has been
studied for modeling and learning diversity
• Probability of selecting a subset 𝒳 [Macchi 1975]:
ℙ 𝒳 =
det 𝐋 𝒳
σ ෡𝒳
det 𝐋 ෡𝒳
=
det 𝐋 𝒳
det 𝐋 + 𝐈
• Sampling in 𝑂 𝑁 𝑘3
time [Hough+ 2006]
• 𝑁: size of ground set
• 𝑘: number of items in sampled subset
[Kulesza & Taskar (2012)]

Prior work has been applying DPPs in
machine learning and artificial intelligence
• Recommendation [Gillenwater+ 2014, Gartrell+ 2017]
• Summarization of document, videos, … [Gong+ 2014]
• Hyper-parameter optimization [Kathuria+ 2016]
• Mini-batch sampling [Zhang+ 2017]
• Modeling neural spiking [Snoek+ 2013]

Nearly exact Boltzmann exploration with
MCMC [Kang 2013 (for 𝛽 = 1)]
1. Initialize 𝐱
2. Repeat
• 𝐱 ← 𝐱′ with probability min 1,
det 𝐋 𝑡 𝐱′
det 𝐋 𝑡 𝐱
𝛽
• When 𝐱’ and 𝐱 differs by one bit, det 𝐋 𝑡 𝐱′
can be computed from
det 𝐋 𝑡 𝐱 via rank-one update techniques
• Assume we can find 𝑎 ← 𝜓−1(𝐱)

Simpler approaches:
Heuristics for more exploration and exploitation
• For more exploration
• Uniform with probability 𝜀
• DPP with probability 1 − 𝜀
• For more exploitation
• Sample 𝑀 actions according to DPP
• Choose the best among 𝑀
• Hard-core point processes [Matérn 1986]

Summary

Need diversity in team-actions Our definition of action similarity
similar
dissimilar
low value when
taken together
high value when
taken together
Challenge
Our solution
Exponentially many team-actions
Efficient learning
∇ 𝐕(𝐱 𝑡) 𝑄 𝜃 𝐳≤𝑡, 𝐱 𝑡 = 2 𝐕 𝐱 𝑡
+ ⊤
∇ 𝜙 𝑄 𝜃 𝐳≤𝑡, 𝐱 𝑡 = diag 𝐕 𝐱 𝑡
+ 𝐕(𝐱 𝑡) ∇ 𝜙 𝐝 𝑡 𝜙
Efficient sampling (cf. DPP)
exp 𝛽𝑄 𝜃 𝐳≤𝑡, 𝐱 𝑡
σ෤𝐱 exp 𝛽𝑄 𝜃 𝐳≤𝑡, ෤𝐱
=
𝛽
σ෤𝐱 det 𝐋 𝑡 ෤𝐱 𝛽

References
• T. Osogami and R. Raymond, Determinantal reinforcement learning,
AAAI-19
• T. Osogami and R. Raymond, Determinantal SARSA: Toward deep
reinforcement learning for collaborative agents, NIPS 2017 Deep
Reinforcement Learning Symposium
• T. Osogami, R. Raymond, A. Goel, T. Shirai, and T. Maehara, Dynamic
determinantal point processes, AAAI-18
• K. Zhao, T. Morimura, and T. Osogami, Event-based visual analytics for
team-based invasion sports, PacificVis 2018

第19回ステアラボ人工知能セミナー発表資料

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to 第19回ステアラボ人工知能セミナー発表資料

Similar to 第19回ステアラボ人工知能セミナー発表資料 (20)

Recently uploaded

Recently uploaded (20)

第19回ステアラボ人工知能セミナー発表資料