SlideShare a Scribd company logo
1 of 42
Download to read offline
多様性を考慮した強化学習・機械学習
恐神 貴行
IBM Research – Tokyo
(C) Copyright IBM Corp. 2018 1
This work was supported by JST CREST Grant Number JPMJCR1304, Japan.
恐神貴行
• IBM東京基礎研究所 シニア・リサーチ・スタッフ・メンバー
• 1998年 東京大学工学部電子工学科卒
• 2005年 カーネギーメロン大学計算機科学科博士号取得
• 研究分野: 確率モデル、逐次的意思決定
(C) Copyright IBM Corp. 2018 2
多様性
(C) Copyright IBM Corp. 2018 3
Want to see relevant and diverse items in
recommendation
(C) Copyright IBM Corp. 2018 4
“trunk” by Bing (little diversity)“trunk” by Google (some diversity)
Can have more diversity
Want to take relevant and diverse actions in
team sports
Zone defense Man-to-man defense
(C) Copyright IBM Corp. 2018 5
basketballhq.com upload.wikimedia.org
Diversity also matters in soccer
(C) Copyright IBM Corp. 2018 6
RoboCupSoccer - Simulation J League (Based on data from DataStadium)
Diversity also matters in Multi-agent games
(C) Copyright IBM Corp. 2018 7
強化学習における多様性の考慮
(C) Copyright IBM Corp. 2018 8
Reinforcement learning seeks to find an
optimal policy for sequential decision making
(C) Copyright IBM Corp. 2018 9
Agent
Action
Environment
Observation, Reward
Goal: Maximize cumulative reward
An approach of reinforcement learning is to
learn the action-value function
• Action-value function: 𝑄 𝜋
• 𝑄 𝜋(𝑠, 𝑎): Expected cumulative reward from state 𝑠 by taking action 𝑎 and then
following (optimal) policy 𝜋
• Assume (relaxed later): Markovian state 𝑠 is fully observable
(C) Copyright IBM Corp. 2018 10
𝑠
𝑎
…
reward …reward reward
𝑄 𝜋(𝑠, 𝑎)
For most practical tasks, the action-value
function needs to be approximated
Exponentially large state space
• Combination of multiple factors
• e.g. Factor: “state” of each position
• History of observations
Exponentially large action space
• Combination of multiple “levers”
• Combination of multiple agents
(C) Copyright IBM Corp. 2018 11
Cannot deal with 𝑄(𝑠, 𝑎) in tabular form
Action-value functions approximated with
(deep) neural networks
(C) Copyright IBM Corp. 2018 12
state 𝑠
Distributed representation
(e.g. each unit for each pixel)
𝑄 𝜋
(𝑠, 𝑎1)
𝑄 𝜋(𝑠, 𝑎2)
Exponentially many
units for exponentially
large action space
Challenges in collaborative multi-agent
reinforcement learning
• Exponentially many combinations of actions (team-actions)
(C) Copyright IBM Corp. 2018 13
1. How to efficiently evaluate the value of team-actions
2. How to efficiently sample good team-actions
Consider the diversity of actions, in addition
to the value (relevance) of each action
(C) Copyright IBM Corp. 2018 14
Diversity
Value of team-action of 3 agents:
Volume2 = 𝑄(𝑠, 𝑎1, 𝑎3, 𝑎5 )
Diversity can be represented by determinant
(C) Copyright IBM Corp. 2018 15
Diversity
Volume2
= 𝑄 𝑠, 𝑎1, 𝑎3, 𝑎5
= det 𝐕(𝐱) 𝐕(𝐱)⊤
𝐕 =
𝑎1
𝑎2
𝑎3
𝑎4
𝑎5
:
=
𝑣11 𝑣12 𝑣13
𝑣21 𝑣22 𝑣23
𝑣31 𝑣32 𝑣33
𝑣41 𝑣42 𝑣43
𝑣51 𝑣52 𝑣53
:
𝐋 = 𝐕 𝐕⊤
𝐱 = 1, 0, 1, 0, 1, …
𝐋 𝐱 = 𝐕 𝐱 𝐕 𝐱 ⊤
Indicate
selected
actions
Our definition of diversity (similarity) in
multi-agent reinforcement learning
• The value of team-action is represented by determinant (volume)
(C) Copyright IBM Corp. 2018 16
We will learn the similarity between actions accordingly
Two actions are similar
Two actions are dissimilar
Value is low when the two actions are taken together
Value is high when the two actions are taken together
We represent the action-value function with
determinant
• 𝐳≤𝑡 ≡ (𝐳0, 𝐳1, … , 𝐳 𝑡): Time−series of observations
• 𝐳≤𝑡 = 𝑠𝑡 if Markovian state 𝑠𝑡 is observable
• 𝐱 𝑡 ≡ 𝜓 𝑎 𝑡 ∈ 0, 1 𝑁: Binary features of team-action 𝑎 𝑡
• e.g. 𝐱 𝑡 indicates which actions are selected by the team
• 𝐋 𝑡: Positive semi-definite matrix (kernel) that can depend on 𝐳≤𝑡
(C) Copyright IBM Corp. 2018 17
𝐋 𝑡(𝐱t)𝑥𝑡 𝑖 = 1
𝐋 𝑡
𝑄 𝜃 𝐳≤𝑡, 𝐱 𝑡 = 𝛼 + log det 𝐋 𝑡 𝐱 𝑡
Particular structure of the kernel for effective
learning
• 𝐋 𝑡 ≡ 𝐕 𝐃 𝑡 𝐕⊤
• 𝐕: 𝑁 × 𝐾 matrix (𝐾 ≤ 𝑁)
• 𝐃 𝑡 ≡ Diagonal exp 𝐝 𝑡 𝜙
• 𝐝 𝑡 𝜙 : differentiable time−series model with parameter 𝜙
(e.g. RNN, LSTM, DyBM, VAR)
(C) Copyright IBM Corp. 2018 18
𝑄 𝜃 𝐳≤𝑡, 𝐱 𝑡 = 𝛼 + log det 𝐋 𝑡 𝐱 𝑡
Special case of diagonal kernel reduces to the
standard approach of ignoring diversity
(C) Copyright IBM Corp. 2018 19
𝑄 𝜃 𝐳≤𝑡, 𝐱 𝑡 = 𝛼 + log det 𝐋 𝑡 𝐱 𝑡
= 𝛼 + 𝐝 𝑡 𝜙 ⊤
𝐱 𝑡
= ෍
𝑖: 𝑥 𝑡 𝑖=1
𝑑 𝑡 𝜙 𝑖
• 𝐋 𝑡 = 𝐃 𝑡 (let 𝐕 = I)
• 𝐃 𝑡 ≡ Diag exp 𝐝 𝑡 𝜙
• 𝐝 𝑡 𝜙 : differentiable time−series model
Sum of the values of selected actions
𝑑 𝑡 𝜙 𝑖: value of 𝑎𝑖 at time 𝑡
𝑥𝑡 𝑖 = 1
𝐋 𝑡
Determinantal layer for diversity
(C) Copyright IBM Corp. 2018 20
state 𝑠
𝐝 𝑡 𝜙
𝑄 𝜃 𝑠, 𝐱 𝑡 = 𝛼 + log det 𝐋 𝑡 𝐱 𝑡
𝐋 𝑡 ≡ 𝐕 Diag exp 𝐝 𝑡 𝜙 𝐕⊤
Prior work uses free-energy of restricted
Boltzmann machines [Sallans & Hinton 2001]
(C) Copyright IBM Corp. 2018 21
state 𝑠
hidden units
𝐱
𝐹 𝐱 = − log ෍
ሚ𝐡
exp −𝐸 𝐱, ሚ𝐡
𝐸 𝐱, 𝐡 ≡ −𝐛⊤
𝐱 − 𝐜⊤
𝐡 − 𝐱⊤
𝐖𝐡
𝐹 𝐱 = −𝐛⊤ 𝐱
No hidden units
𝐡 Bias 𝐛 represents individual
value of each action
強化学習による「多様性」の学習
(C) Copyright IBM Corp. 2018 22
Reinforcement learning with SARSA
• Tabular case
𝑄 𝑠𝑡, 𝑎 𝑡 ← 𝑄 𝑠𝑡, 𝑎 𝑡 + 𝜂 TD 𝑡
where TD 𝑡 ≡ 𝑟𝑡+1 + 𝜌 𝑄 𝑠𝑡+1, 𝑎 𝑡+1 − 𝑄 𝑠𝑡, 𝑎 𝑡
• With functional approximation: 𝑄 𝜃 𝑠, 𝑎 ≈ 𝑄 𝑠, 𝑎
𝜃 ← 𝜃 + 𝜂 TD 𝑡 𝛻𝜃 𝑄 𝜃(𝑠𝑡, 𝑎 𝑡)
(C) Copyright IBM Corp. 2018 23
Cumulative reward
from 𝑡
Cumulative reward
from 𝑡
Can learn our kernel 𝐋 𝑡 via SARSA in an end-
to-end manner
• 𝑄 𝜃 𝐳≤𝑡, 𝐱 𝑡 = 𝛼 + log det 𝐋 𝑡 𝐱 𝑡
• 𝐋 𝑡 ≡ 𝐕 𝐃 𝑡 𝐕⊤
• 𝐃 𝑡 ≡ Diagonal exp 𝐝 𝑡 𝜙
(C) Copyright IBM Corp. 2018 24
∇ 𝛼 𝑄 𝜃 𝐳≤𝑡, 𝐱 𝑡 = 1
∇ 𝐕(ത𝐱 𝑡) 𝑄 𝜃 𝐳≤𝑡, 𝐱 𝑡 = 𝟎
∇ 𝐕(𝐱 𝑡) 𝑄 𝜃 𝐳≤𝑡, 𝐱 𝑡 = 2 𝐕 𝐱 𝑡
+ ⊤
∇ 𝜙 𝑄 𝜃 𝐳≤𝑡, 𝐱 𝑡 = diag 𝐕 𝐱 𝑡
+
𝐕(𝐱 𝑡) ∇ 𝜙 𝐝 𝑡 𝜙
𝜃 ← 𝜃 + 𝜂 TD 𝑡 𝛻𝜃 𝑄 𝜃(𝑠𝑡, 𝑎 𝑡)
Example: Blocker Task
• Reward
• +1 when an attacker reaches the end zone
• −1 each step
• We use target positions of agents as the feature of state-action pair
• 𝐱 𝑡 ≡ 𝜓 𝑎 𝑡 = 𝑠𝑡+1 ∈ 0, 1 21
(C) Copyright IBM Corp. 2018 25
Determinantal SARSA finds a nearly optimal
policy 10 times faster than baseline methods
(C) Copyright IBM Corp. 2018 26
[Heese+ 2013]
SARSA: Free-energy SARSA
NN: Neural network
EQNAC: Energy-based Q-value natural actor critic
ENATDAC: Energy-based natural TD actor critic
Determinantal SARSA
Quality-similarity decomposition of the kernel
• Value, relevance, quality
𝑞𝑖 = 𝐿𝑖,𝑖
• Similarity
𝑆𝑖,𝑗 =
𝐿𝑖,𝑗
𝐿𝑖,𝑖 𝐿𝑗,𝑗
(C) Copyright IBM Corp. 2018 27
𝐋
𝐪
𝐪𝐒=
Value of individual actions (next positions)
learned by Determinantal SARSA
(C) Copyright IBM Corp. 2018 28
(C) Copyright IBM Corp. 2018 29
similar
dissimilar
low value when
taken together
high value when
taken together
Recall:
Our definition of action similarity
Similarity between actions (next positions)
learned by Determinantal SARSA
Stochastic Policy Task
• 2 𝑁 actions
• If action matches states
• Get +10 reward
• Hidden state transitions
(C) Copyright IBM Corp. 2018 30
00101
01011
Determinantal SARSA finds nearly optimal policies,
while baselines suffer from partial observability
(C) Copyright IBM Corp. 2018 31
(𝑁 = 5) (𝑁 = 6) (𝑁 = 8)
SARSA: Free-energy SARSA
NN: Neural network
EQNAC: Energy-based Q-value natural actor critic
ENATDAC: Energy-based natural TD actor critic
Determinantal SARSA Determinantal SARSA Determinantal SARSA
Baselines from [Heese+ 2013]
「多様」な行動の選択
(C) Copyright IBM Corp. 2018 32
Want to sample actions having high value
with high probability
• 𝜀-greedy
• Uniformly at random with probability 𝜀
• Best action (𝑎⋆ = argmax 𝑎 𝑄(𝑠, 𝑎)) with probability 1 − 𝜀
Intractable for large action space
• Boltzmann exploration
• Take action 𝑎 with probability ∼ exp(−𝛽 𝑄(𝑠, 𝑎))
• 𝛽: inverse temperature
(C) Copyright IBM Corp. 2018 33
Choosing a diverse team-action with
Boltzmann exploration
• Our value function:
𝑄 𝜃 𝐳≤𝑡, 𝐱 𝑡 = 𝛼 + log det 𝐋 𝑡 𝐱 𝑡
• Team-action selected according to Boltzmann exploration
𝜋 𝐱 𝑡 𝐳≤𝑡 =
exp 𝛽𝑄 𝜃 𝐳≤𝑡, 𝐱 𝑡
σ෤𝐱 exp 𝛽𝑄 𝜃 𝐳≤𝑡, ෤𝐱
=
det 𝐋 𝑡 𝐱 𝑡
𝛽
σ෤𝐱 det 𝐋 𝑡 ෤𝐱 𝛽
(C) Copyright IBM Corp. 2018 34
𝐱 𝑡 ≡ 𝜓 𝑎 𝑡 ∈ 0, 1 𝑁:
Binary features of team-action 𝑎 𝑡
Boltzmann exploration can be performed
efficiently when 𝛽 = 1
𝜋 𝐱 𝑡 𝐳≤𝑡 =
det 𝐋 𝑡 𝐱 𝑡
σ෤𝐱 det 𝐋 𝑡 ෤𝐱
=
det 𝐋 𝑡 𝐱 𝑡
det 𝐋 𝑡 + 𝐈
(C) Copyright IBM Corp. 2018 35
Sum over 2 𝑁 terms Determinant of 𝑁 × 𝑁 matrix
Determinantal point process (DPP) has been
studied for modeling and learning diversity
• Probability of selecting a subset 𝒳 [Macchi 1975]:
ℙ 𝒳 =
det 𝐋 𝒳
σ ෡𝒳
det 𝐋 ෡𝒳
=
det 𝐋 𝒳
det 𝐋 + 𝐈
• Sampling in 𝑂 𝑁 𝑘3
time [Hough+ 2006]
• 𝑁: size of ground set
• 𝑘: number of items in sampled subset
(C) Copyright IBM Corp. 2018 36
[Kulesza & Taskar (2012)]
Prior work has been applying DPPs in
machine learning and artificial intelligence
• Recommendation [Gillenwater+ 2014, Gartrell+ 2017]
• Summarization of document, videos, … [Gong+ 2014]
• Hyper-parameter optimization [Kathuria+ 2016]
• Mini-batch sampling [Zhang+ 2017]
• Modeling neural spiking [Snoek+ 2013]
(C) Copyright IBM Corp. 2018 37
Nearly exact Boltzmann exploration with
MCMC [Kang 2013 (for 𝛽 = 1)]
1. Initialize 𝐱
2. Repeat
• 𝐱 ← 𝐱′ with probability min 1,
det 𝐋 𝑡 𝐱′
det 𝐋 𝑡 𝐱
𝛽
• When 𝐱’ and 𝐱 differs by one bit, det 𝐋 𝑡 𝐱′
can be computed from
det 𝐋 𝑡 𝐱 via rank-one update techniques
• Assume we can find 𝑎 ← 𝜓−1(𝐱)
(C) Copyright IBM Corp. 2018 38
Simpler approaches:
Heuristics for more exploration and exploitation
• For more exploration
• Uniform with probability 𝜀
• DPP with probability 1 − 𝜀
• For more exploitation
• Sample 𝑀 actions according to DPP
• Choose the best among 𝑀
• Hard-core point processes [Matérn 1986]
(C) Copyright IBM Corp. 2018 39
Summary
(C) Copyright IBM Corp. 2018 40
(C) Copyright IBM Corp. 2018 41
Need diversity in team-actions Our definition of action similarity
similar
dissimilar
low value when
taken together
high value when
taken together
Challenge
𝑄 𝜃 𝐳≤𝑡, 𝐱 𝑡 = 𝛼 + log det 𝐋 𝑡 𝐱 𝑡
Our solution
Exponentially many team-actions
Efficient learning
∇ 𝐕(𝐱 𝑡) 𝑄 𝜃 𝐳≤𝑡, 𝐱 𝑡 = 2 𝐕 𝐱 𝑡
+ ⊤
∇ 𝜙 𝑄 𝜃 𝐳≤𝑡, 𝐱 𝑡 = diag 𝐕 𝐱 𝑡
+ 𝐕(𝐱 𝑡) ∇ 𝜙 𝐝 𝑡 𝜙
Efficient sampling (cf. DPP)
𝜋 𝐱 𝑡 𝐳≤𝑡 =
exp 𝛽𝑄 𝜃 𝐳≤𝑡, 𝐱 𝑡
σ෤𝐱 exp 𝛽𝑄 𝜃 𝐳≤𝑡, ෤𝐱
=
det 𝐋 𝑡 𝐱 𝑡
𝛽
σ෤𝐱 det 𝐋 𝑡 ෤𝐱 𝛽
References
• T. Osogami and R. Raymond, Determinantal reinforcement learning,
AAAI-19
• T. Osogami and R. Raymond, Determinantal SARSA: Toward deep
reinforcement learning for collaborative agents, NIPS 2017 Deep
Reinforcement Learning Symposium
• T. Osogami, R. Raymond, A. Goel, T. Shirai, and T. Maehara, Dynamic
determinantal point processes, AAAI-18
• K. Zhao, T. Morimura, and T. Osogami, Event-based visual analytics for
team-based invasion sports, PacificVis 2018
(C) Copyright IBM Corp. 2018 42

More Related Content

What's hot

A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...
A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...
A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...NECST Lab @ Politecnico di Milano
 
Applied Machine Learning For Search Engine Relevance
Applied Machine Learning For Search Engine Relevance Applied Machine Learning For Search Engine Relevance
Applied Machine Learning For Search Engine Relevance charlesmartin14
 
Stochastic gradient descent and its tuning
Stochastic gradient descent and its tuningStochastic gradient descent and its tuning
Stochastic gradient descent and its tuningArsalan Qadri
 
A Brief Survey of Reinforcement Learning
A Brief Survey of Reinforcement LearningA Brief Survey of Reinforcement Learning
A Brief Survey of Reinforcement LearningGiancarlo Frison
 
Inverse Kinematics Using Genetic Algorithms
Inverse Kinematics Using Genetic AlgorithmsInverse Kinematics Using Genetic Algorithms
Inverse Kinematics Using Genetic Algorithmssatyendrajaladi
 
Multinomial Logistic Regression with Apache Spark
Multinomial Logistic Regression with Apache SparkMultinomial Logistic Regression with Apache Spark
Multinomial Logistic Regression with Apache SparkDB Tsai
 
Scaling out logistic regression with Spark
Scaling out logistic regression with SparkScaling out logistic regression with Spark
Scaling out logistic regression with SparkBarak Gitsis
 
Parallel Optimization in Machine Learning
Parallel Optimization in Machine LearningParallel Optimization in Machine Learning
Parallel Optimization in Machine LearningFabian Pedregosa
 
Distributed Deep Q-Learning
Distributed Deep Q-LearningDistributed Deep Q-Learning
Distributed Deep Q-LearningLyft
 
Gbm.more GBM in H2O
Gbm.more GBM in H2OGbm.more GBM in H2O
Gbm.more GBM in H2OSri Ambati
 
Gradient Estimation Using Stochastic Computation Graphs
Gradient Estimation Using Stochastic Computation GraphsGradient Estimation Using Stochastic Computation Graphs
Gradient Estimation Using Stochastic Computation GraphsYoonho Lee
 
B Eng Final Year Project Presentation
B Eng Final Year Project PresentationB Eng Final Year Project Presentation
B Eng Final Year Project Presentationjesujoseph
 
Towards typesafe deep learning in scala
Towards typesafe deep learning in scalaTowards typesafe deep learning in scala
Towards typesafe deep learning in scalaTongfei Chen
 
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its authorKaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its authorVivian S. Zhang
 
ENBIS 2018 presentation on Deep k-Means
ENBIS 2018 presentation on Deep k-MeansENBIS 2018 presentation on Deep k-Means
ENBIS 2018 presentation on Deep k-Meanstthonet
 
Distributed Formal Concept Analysis Algorithms Based on an Iterative MapReduc...
Distributed Formal Concept Analysis Algorithms Based on an Iterative MapReduc...Distributed Formal Concept Analysis Algorithms Based on an Iterative MapReduc...
Distributed Formal Concept Analysis Algorithms Based on an Iterative MapReduc...Ruairi de Frein
 
Optimization in deep learning
Optimization in deep learningOptimization in deep learning
Optimization in deep learningJeremy Nixon
 
Hyperparameter optimization with approximate gradient
Hyperparameter optimization with approximate gradientHyperparameter optimization with approximate gradient
Hyperparameter optimization with approximate gradientFabian Pedregosa
 
Georgetown B-school Talk 2021
Georgetown B-school Talk  2021Georgetown B-school Talk  2021
Georgetown B-school Talk 2021Charles Martin
 

What's hot (19)

A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...
A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...
A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...
 
Applied Machine Learning For Search Engine Relevance
Applied Machine Learning For Search Engine Relevance Applied Machine Learning For Search Engine Relevance
Applied Machine Learning For Search Engine Relevance
 
Stochastic gradient descent and its tuning
Stochastic gradient descent and its tuningStochastic gradient descent and its tuning
Stochastic gradient descent and its tuning
 
A Brief Survey of Reinforcement Learning
A Brief Survey of Reinforcement LearningA Brief Survey of Reinforcement Learning
A Brief Survey of Reinforcement Learning
 
Inverse Kinematics Using Genetic Algorithms
Inverse Kinematics Using Genetic AlgorithmsInverse Kinematics Using Genetic Algorithms
Inverse Kinematics Using Genetic Algorithms
 
Multinomial Logistic Regression with Apache Spark
Multinomial Logistic Regression with Apache SparkMultinomial Logistic Regression with Apache Spark
Multinomial Logistic Regression with Apache Spark
 
Scaling out logistic regression with Spark
Scaling out logistic regression with SparkScaling out logistic regression with Spark
Scaling out logistic regression with Spark
 
Parallel Optimization in Machine Learning
Parallel Optimization in Machine LearningParallel Optimization in Machine Learning
Parallel Optimization in Machine Learning
 
Distributed Deep Q-Learning
Distributed Deep Q-LearningDistributed Deep Q-Learning
Distributed Deep Q-Learning
 
Gbm.more GBM in H2O
Gbm.more GBM in H2OGbm.more GBM in H2O
Gbm.more GBM in H2O
 
Gradient Estimation Using Stochastic Computation Graphs
Gradient Estimation Using Stochastic Computation GraphsGradient Estimation Using Stochastic Computation Graphs
Gradient Estimation Using Stochastic Computation Graphs
 
B Eng Final Year Project Presentation
B Eng Final Year Project PresentationB Eng Final Year Project Presentation
B Eng Final Year Project Presentation
 
Towards typesafe deep learning in scala
Towards typesafe deep learning in scalaTowards typesafe deep learning in scala
Towards typesafe deep learning in scala
 
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its authorKaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
 
ENBIS 2018 presentation on Deep k-Means
ENBIS 2018 presentation on Deep k-MeansENBIS 2018 presentation on Deep k-Means
ENBIS 2018 presentation on Deep k-Means
 
Distributed Formal Concept Analysis Algorithms Based on an Iterative MapReduc...
Distributed Formal Concept Analysis Algorithms Based on an Iterative MapReduc...Distributed Formal Concept Analysis Algorithms Based on an Iterative MapReduc...
Distributed Formal Concept Analysis Algorithms Based on an Iterative MapReduc...
 
Optimization in deep learning
Optimization in deep learningOptimization in deep learning
Optimization in deep learning
 
Hyperparameter optimization with approximate gradient
Hyperparameter optimization with approximate gradientHyperparameter optimization with approximate gradient
Hyperparameter optimization with approximate gradient
 
Georgetown B-school Talk 2021
Georgetown B-school Talk  2021Georgetown B-school Talk  2021
Georgetown B-school Talk 2021
 

Similar to 第19回ステアラボ人工知能セミナー発表資料

Parallel Machine Learning- DSGD and SystemML
Parallel Machine Learning- DSGD and SystemMLParallel Machine Learning- DSGD and SystemML
Parallel Machine Learning- DSGD and SystemMLJanani C
 
Many-Objective Performance Enhancement in Computing Clusters
Many-Objective Performance Enhancement in Computing ClustersMany-Objective Performance Enhancement in Computing Clusters
Many-Objective Performance Enhancement in Computing ClustersTarik Reza Toha
 
Analysis of Algorithm full version 2024.pptx
Analysis of Algorithm  full version  2024.pptxAnalysis of Algorithm  full version  2024.pptx
Analysis of Algorithm full version 2024.pptxrajesshs31r
 
Design Analysis of Alogorithm 1 ppt 2024.pptx
Design Analysis of Alogorithm 1 ppt 2024.pptxDesign Analysis of Alogorithm 1 ppt 2024.pptx
Design Analysis of Alogorithm 1 ppt 2024.pptxrajesshs31r
 
Introduction to optimization technique
Introduction to optimization techniqueIntroduction to optimization technique
Introduction to optimization techniqueKAMINISINGH963
 
Why Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley
Why Deep Learning Works: Dec 13, 2018 at ICSI, UC BerkeleyWhy Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley
Why Deep Learning Works: Dec 13, 2018 at ICSI, UC BerkeleyCharles Martin
 
Stanford ICME Lecture on Why Deep Learning Works
Stanford ICME Lecture on Why Deep Learning WorksStanford ICME Lecture on Why Deep Learning Works
Stanford ICME Lecture on Why Deep Learning WorksCharles Martin
 
Paper Study: Melding the data decision pipeline
Paper Study: Melding the data decision pipelinePaper Study: Melding the data decision pipeline
Paper Study: Melding the data decision pipelineChenYiHuang5
 
VCE Unit 01 (1).pptx
VCE Unit 01 (1).pptxVCE Unit 01 (1).pptx
VCE Unit 01 (1).pptxskilljiolms
 
Tutorial-on-DNN-09A-Co-design-Sparsity.pdf
Tutorial-on-DNN-09A-Co-design-Sparsity.pdfTutorial-on-DNN-09A-Co-design-Sparsity.pdf
Tutorial-on-DNN-09A-Co-design-Sparsity.pdfDuy-Hieu Bui
 
Data Structures and Algorithm Analysis
Data Structures  and  Algorithm AnalysisData Structures  and  Algorithm Analysis
Data Structures and Algorithm AnalysisMary Margarat
 
Complexity Analysis
Complexity Analysis Complexity Analysis
Complexity Analysis Shaista Qadir
 
WeightWatcher LLM Update
WeightWatcher LLM UpdateWeightWatcher LLM Update
WeightWatcher LLM UpdateCharles Martin
 
An Introduction to Reinforcement Learning - The Doors to AGI
An Introduction to Reinforcement Learning - The Doors to AGIAn Introduction to Reinforcement Learning - The Doors to AGI
An Introduction to Reinforcement Learning - The Doors to AGIAnirban Santara
 
adversarial robustness lecture
adversarial robustness lectureadversarial robustness lecture
adversarial robustness lectureMuhammadAhmedShah2
 
Weight watcher Bay Area ACM Feb 28, 2022
Weight watcher Bay Area ACM Feb 28, 2022 Weight watcher Bay Area ACM Feb 28, 2022
Weight watcher Bay Area ACM Feb 28, 2022 Charles Martin
 
# Can we trust ai. the dilemma of model adjustment
# Can we trust ai. the dilemma of model adjustment# Can we trust ai. the dilemma of model adjustment
# Can we trust ai. the dilemma of model adjustmentTerence Huang
 

Similar to 第19回ステアラボ人工知能セミナー発表資料 (20)

Parallel Machine Learning- DSGD and SystemML
Parallel Machine Learning- DSGD and SystemMLParallel Machine Learning- DSGD and SystemML
Parallel Machine Learning- DSGD and SystemML
 
Many-Objective Performance Enhancement in Computing Clusters
Many-Objective Performance Enhancement in Computing ClustersMany-Objective Performance Enhancement in Computing Clusters
Many-Objective Performance Enhancement in Computing Clusters
 
Analysis of Algorithm full version 2024.pptx
Analysis of Algorithm  full version  2024.pptxAnalysis of Algorithm  full version  2024.pptx
Analysis of Algorithm full version 2024.pptx
 
Design Analysis of Alogorithm 1 ppt 2024.pptx
Design Analysis of Alogorithm 1 ppt 2024.pptxDesign Analysis of Alogorithm 1 ppt 2024.pptx
Design Analysis of Alogorithm 1 ppt 2024.pptx
 
Introduction to optimization technique
Introduction to optimization techniqueIntroduction to optimization technique
Introduction to optimization technique
 
Why Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley
Why Deep Learning Works: Dec 13, 2018 at ICSI, UC BerkeleyWhy Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley
Why Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley
 
Stanford ICME Lecture on Why Deep Learning Works
Stanford ICME Lecture on Why Deep Learning WorksStanford ICME Lecture on Why Deep Learning Works
Stanford ICME Lecture on Why Deep Learning Works
 
Paper Study: Melding the data decision pipeline
Paper Study: Melding the data decision pipelinePaper Study: Melding the data decision pipeline
Paper Study: Melding the data decision pipeline
 
VCE Unit 01 (1).pptx
VCE Unit 01 (1).pptxVCE Unit 01 (1).pptx
VCE Unit 01 (1).pptx
 
Tutorial-on-DNN-09A-Co-design-Sparsity.pdf
Tutorial-on-DNN-09A-Co-design-Sparsity.pdfTutorial-on-DNN-09A-Co-design-Sparsity.pdf
Tutorial-on-DNN-09A-Co-design-Sparsity.pdf
 
Data Structures and Algorithm Analysis
Data Structures  and  Algorithm AnalysisData Structures  and  Algorithm Analysis
Data Structures and Algorithm Analysis
 
Complexity Analysis
Complexity Analysis Complexity Analysis
Complexity Analysis
 
WeightWatcher LLM Update
WeightWatcher LLM UpdateWeightWatcher LLM Update
WeightWatcher LLM Update
 
Xgboost
XgboostXgboost
Xgboost
 
Deep ar presentation
Deep ar presentationDeep ar presentation
Deep ar presentation
 
DiscoGAN
DiscoGANDiscoGAN
DiscoGAN
 
An Introduction to Reinforcement Learning - The Doors to AGI
An Introduction to Reinforcement Learning - The Doors to AGIAn Introduction to Reinforcement Learning - The Doors to AGI
An Introduction to Reinforcement Learning - The Doors to AGI
 
adversarial robustness lecture
adversarial robustness lectureadversarial robustness lecture
adversarial robustness lecture
 
Weight watcher Bay Area ACM Feb 28, 2022
Weight watcher Bay Area ACM Feb 28, 2022 Weight watcher Bay Area ACM Feb 28, 2022
Weight watcher Bay Area ACM Feb 28, 2022
 
# Can we trust ai. the dilemma of model adjustment
# Can we trust ai. the dilemma of model adjustment# Can we trust ai. the dilemma of model adjustment
# Can we trust ai. the dilemma of model adjustment
 

Recently uploaded

Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTSérgio Sacani
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡anilsa9823
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsAArockiyaNisha
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Lokesh Kothari
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxAArockiyaNisha
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...RohitNehra6
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfrohankumarsinghrore1
 
fundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomologyfundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomologyDrAnita Sharma
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PPRINCE C P
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)Areesha Ahmad
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bSérgio Sacani
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxgindu3009
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sérgio Sacani
 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINsankalpkumarsahoo174
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPirithiRaju
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptxanandsmhk
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxUmerFayaz5
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...anilsa9823
 

Recently uploaded (20)

Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based Nanomaterials
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 
fundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomologyfundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomology
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C P
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
 
The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptx
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
 

第19回ステアラボ人工知能セミナー発表資料

  • 1. 多様性を考慮した強化学習・機械学習 恐神 貴行 IBM Research – Tokyo (C) Copyright IBM Corp. 2018 1 This work was supported by JST CREST Grant Number JPMJCR1304, Japan.
  • 2. 恐神貴行 • IBM東京基礎研究所 シニア・リサーチ・スタッフ・メンバー • 1998年 東京大学工学部電子工学科卒 • 2005年 カーネギーメロン大学計算機科学科博士号取得 • 研究分野: 確率モデル、逐次的意思決定 (C) Copyright IBM Corp. 2018 2
  • 4. Want to see relevant and diverse items in recommendation (C) Copyright IBM Corp. 2018 4 “trunk” by Bing (little diversity)“trunk” by Google (some diversity) Can have more diversity
  • 5. Want to take relevant and diverse actions in team sports Zone defense Man-to-man defense (C) Copyright IBM Corp. 2018 5 basketballhq.com upload.wikimedia.org
  • 6. Diversity also matters in soccer (C) Copyright IBM Corp. 2018 6 RoboCupSoccer - Simulation J League (Based on data from DataStadium)
  • 7. Diversity also matters in Multi-agent games (C) Copyright IBM Corp. 2018 7
  • 9. Reinforcement learning seeks to find an optimal policy for sequential decision making (C) Copyright IBM Corp. 2018 9 Agent Action Environment Observation, Reward Goal: Maximize cumulative reward
  • 10. An approach of reinforcement learning is to learn the action-value function • Action-value function: 𝑄 𝜋 • 𝑄 𝜋(𝑠, 𝑎): Expected cumulative reward from state 𝑠 by taking action 𝑎 and then following (optimal) policy 𝜋 • Assume (relaxed later): Markovian state 𝑠 is fully observable (C) Copyright IBM Corp. 2018 10 𝑠 𝑎 … reward …reward reward 𝑄 𝜋(𝑠, 𝑎)
  • 11. For most practical tasks, the action-value function needs to be approximated Exponentially large state space • Combination of multiple factors • e.g. Factor: “state” of each position • History of observations Exponentially large action space • Combination of multiple “levers” • Combination of multiple agents (C) Copyright IBM Corp. 2018 11 Cannot deal with 𝑄(𝑠, 𝑎) in tabular form
  • 12. Action-value functions approximated with (deep) neural networks (C) Copyright IBM Corp. 2018 12 state 𝑠 Distributed representation (e.g. each unit for each pixel) 𝑄 𝜋 (𝑠, 𝑎1) 𝑄 𝜋(𝑠, 𝑎2) Exponentially many units for exponentially large action space
  • 13. Challenges in collaborative multi-agent reinforcement learning • Exponentially many combinations of actions (team-actions) (C) Copyright IBM Corp. 2018 13 1. How to efficiently evaluate the value of team-actions 2. How to efficiently sample good team-actions
  • 14. Consider the diversity of actions, in addition to the value (relevance) of each action (C) Copyright IBM Corp. 2018 14 Diversity Value of team-action of 3 agents: Volume2 = 𝑄(𝑠, 𝑎1, 𝑎3, 𝑎5 )
  • 15. Diversity can be represented by determinant (C) Copyright IBM Corp. 2018 15 Diversity Volume2 = 𝑄 𝑠, 𝑎1, 𝑎3, 𝑎5 = det 𝐕(𝐱) 𝐕(𝐱)⊤ 𝐕 = 𝑎1 𝑎2 𝑎3 𝑎4 𝑎5 : = 𝑣11 𝑣12 𝑣13 𝑣21 𝑣22 𝑣23 𝑣31 𝑣32 𝑣33 𝑣41 𝑣42 𝑣43 𝑣51 𝑣52 𝑣53 : 𝐋 = 𝐕 𝐕⊤ 𝐱 = 1, 0, 1, 0, 1, … 𝐋 𝐱 = 𝐕 𝐱 𝐕 𝐱 ⊤ Indicate selected actions
  • 16. Our definition of diversity (similarity) in multi-agent reinforcement learning • The value of team-action is represented by determinant (volume) (C) Copyright IBM Corp. 2018 16 We will learn the similarity between actions accordingly Two actions are similar Two actions are dissimilar Value is low when the two actions are taken together Value is high when the two actions are taken together
  • 17. We represent the action-value function with determinant • 𝐳≤𝑡 ≡ (𝐳0, 𝐳1, … , 𝐳 𝑡): Time−series of observations • 𝐳≤𝑡 = 𝑠𝑡 if Markovian state 𝑠𝑡 is observable • 𝐱 𝑡 ≡ 𝜓 𝑎 𝑡 ∈ 0, 1 𝑁: Binary features of team-action 𝑎 𝑡 • e.g. 𝐱 𝑡 indicates which actions are selected by the team • 𝐋 𝑡: Positive semi-definite matrix (kernel) that can depend on 𝐳≤𝑡 (C) Copyright IBM Corp. 2018 17 𝐋 𝑡(𝐱t)𝑥𝑡 𝑖 = 1 𝐋 𝑡 𝑄 𝜃 𝐳≤𝑡, 𝐱 𝑡 = 𝛼 + log det 𝐋 𝑡 𝐱 𝑡
  • 18. Particular structure of the kernel for effective learning • 𝐋 𝑡 ≡ 𝐕 𝐃 𝑡 𝐕⊤ • 𝐕: 𝑁 × 𝐾 matrix (𝐾 ≤ 𝑁) • 𝐃 𝑡 ≡ Diagonal exp 𝐝 𝑡 𝜙 • 𝐝 𝑡 𝜙 : differentiable time−series model with parameter 𝜙 (e.g. RNN, LSTM, DyBM, VAR) (C) Copyright IBM Corp. 2018 18 𝑄 𝜃 𝐳≤𝑡, 𝐱 𝑡 = 𝛼 + log det 𝐋 𝑡 𝐱 𝑡
  • 19. Special case of diagonal kernel reduces to the standard approach of ignoring diversity (C) Copyright IBM Corp. 2018 19 𝑄 𝜃 𝐳≤𝑡, 𝐱 𝑡 = 𝛼 + log det 𝐋 𝑡 𝐱 𝑡 = 𝛼 + 𝐝 𝑡 𝜙 ⊤ 𝐱 𝑡 = ෍ 𝑖: 𝑥 𝑡 𝑖=1 𝑑 𝑡 𝜙 𝑖 • 𝐋 𝑡 = 𝐃 𝑡 (let 𝐕 = I) • 𝐃 𝑡 ≡ Diag exp 𝐝 𝑡 𝜙 • 𝐝 𝑡 𝜙 : differentiable time−series model Sum of the values of selected actions 𝑑 𝑡 𝜙 𝑖: value of 𝑎𝑖 at time 𝑡 𝑥𝑡 𝑖 = 1 𝐋 𝑡
  • 20. Determinantal layer for diversity (C) Copyright IBM Corp. 2018 20 state 𝑠 𝐝 𝑡 𝜙 𝑄 𝜃 𝑠, 𝐱 𝑡 = 𝛼 + log det 𝐋 𝑡 𝐱 𝑡 𝐋 𝑡 ≡ 𝐕 Diag exp 𝐝 𝑡 𝜙 𝐕⊤
  • 21. Prior work uses free-energy of restricted Boltzmann machines [Sallans & Hinton 2001] (C) Copyright IBM Corp. 2018 21 state 𝑠 hidden units 𝐱 𝐹 𝐱 = − log ෍ ሚ𝐡 exp −𝐸 𝐱, ሚ𝐡 𝐸 𝐱, 𝐡 ≡ −𝐛⊤ 𝐱 − 𝐜⊤ 𝐡 − 𝐱⊤ 𝐖𝐡 𝐹 𝐱 = −𝐛⊤ 𝐱 No hidden units 𝐡 Bias 𝐛 represents individual value of each action
  • 23. Reinforcement learning with SARSA • Tabular case 𝑄 𝑠𝑡, 𝑎 𝑡 ← 𝑄 𝑠𝑡, 𝑎 𝑡 + 𝜂 TD 𝑡 where TD 𝑡 ≡ 𝑟𝑡+1 + 𝜌 𝑄 𝑠𝑡+1, 𝑎 𝑡+1 − 𝑄 𝑠𝑡, 𝑎 𝑡 • With functional approximation: 𝑄 𝜃 𝑠, 𝑎 ≈ 𝑄 𝑠, 𝑎 𝜃 ← 𝜃 + 𝜂 TD 𝑡 𝛻𝜃 𝑄 𝜃(𝑠𝑡, 𝑎 𝑡) (C) Copyright IBM Corp. 2018 23 Cumulative reward from 𝑡 Cumulative reward from 𝑡
  • 24. Can learn our kernel 𝐋 𝑡 via SARSA in an end- to-end manner • 𝑄 𝜃 𝐳≤𝑡, 𝐱 𝑡 = 𝛼 + log det 𝐋 𝑡 𝐱 𝑡 • 𝐋 𝑡 ≡ 𝐕 𝐃 𝑡 𝐕⊤ • 𝐃 𝑡 ≡ Diagonal exp 𝐝 𝑡 𝜙 (C) Copyright IBM Corp. 2018 24 ∇ 𝛼 𝑄 𝜃 𝐳≤𝑡, 𝐱 𝑡 = 1 ∇ 𝐕(ത𝐱 𝑡) 𝑄 𝜃 𝐳≤𝑡, 𝐱 𝑡 = 𝟎 ∇ 𝐕(𝐱 𝑡) 𝑄 𝜃 𝐳≤𝑡, 𝐱 𝑡 = 2 𝐕 𝐱 𝑡 + ⊤ ∇ 𝜙 𝑄 𝜃 𝐳≤𝑡, 𝐱 𝑡 = diag 𝐕 𝐱 𝑡 + 𝐕(𝐱 𝑡) ∇ 𝜙 𝐝 𝑡 𝜙 𝜃 ← 𝜃 + 𝜂 TD 𝑡 𝛻𝜃 𝑄 𝜃(𝑠𝑡, 𝑎 𝑡)
  • 25. Example: Blocker Task • Reward • +1 when an attacker reaches the end zone • −1 each step • We use target positions of agents as the feature of state-action pair • 𝐱 𝑡 ≡ 𝜓 𝑎 𝑡 = 𝑠𝑡+1 ∈ 0, 1 21 (C) Copyright IBM Corp. 2018 25
  • 26. Determinantal SARSA finds a nearly optimal policy 10 times faster than baseline methods (C) Copyright IBM Corp. 2018 26 [Heese+ 2013] SARSA: Free-energy SARSA NN: Neural network EQNAC: Energy-based Q-value natural actor critic ENATDAC: Energy-based natural TD actor critic Determinantal SARSA
  • 27. Quality-similarity decomposition of the kernel • Value, relevance, quality 𝑞𝑖 = 𝐿𝑖,𝑖 • Similarity 𝑆𝑖,𝑗 = 𝐿𝑖,𝑗 𝐿𝑖,𝑖 𝐿𝑗,𝑗 (C) Copyright IBM Corp. 2018 27 𝐋 𝐪 𝐪𝐒=
  • 28. Value of individual actions (next positions) learned by Determinantal SARSA (C) Copyright IBM Corp. 2018 28
  • 29. (C) Copyright IBM Corp. 2018 29 similar dissimilar low value when taken together high value when taken together Recall: Our definition of action similarity Similarity between actions (next positions) learned by Determinantal SARSA
  • 30. Stochastic Policy Task • 2 𝑁 actions • If action matches states • Get +10 reward • Hidden state transitions (C) Copyright IBM Corp. 2018 30 00101 01011
  • 31. Determinantal SARSA finds nearly optimal policies, while baselines suffer from partial observability (C) Copyright IBM Corp. 2018 31 (𝑁 = 5) (𝑁 = 6) (𝑁 = 8) SARSA: Free-energy SARSA NN: Neural network EQNAC: Energy-based Q-value natural actor critic ENATDAC: Energy-based natural TD actor critic Determinantal SARSA Determinantal SARSA Determinantal SARSA Baselines from [Heese+ 2013]
  • 33. Want to sample actions having high value with high probability • 𝜀-greedy • Uniformly at random with probability 𝜀 • Best action (𝑎⋆ = argmax 𝑎 𝑄(𝑠, 𝑎)) with probability 1 − 𝜀 Intractable for large action space • Boltzmann exploration • Take action 𝑎 with probability ∼ exp(−𝛽 𝑄(𝑠, 𝑎)) • 𝛽: inverse temperature (C) Copyright IBM Corp. 2018 33
  • 34. Choosing a diverse team-action with Boltzmann exploration • Our value function: 𝑄 𝜃 𝐳≤𝑡, 𝐱 𝑡 = 𝛼 + log det 𝐋 𝑡 𝐱 𝑡 • Team-action selected according to Boltzmann exploration 𝜋 𝐱 𝑡 𝐳≤𝑡 = exp 𝛽𝑄 𝜃 𝐳≤𝑡, 𝐱 𝑡 σ෤𝐱 exp 𝛽𝑄 𝜃 𝐳≤𝑡, ෤𝐱 = det 𝐋 𝑡 𝐱 𝑡 𝛽 σ෤𝐱 det 𝐋 𝑡 ෤𝐱 𝛽 (C) Copyright IBM Corp. 2018 34 𝐱 𝑡 ≡ 𝜓 𝑎 𝑡 ∈ 0, 1 𝑁: Binary features of team-action 𝑎 𝑡
  • 35. Boltzmann exploration can be performed efficiently when 𝛽 = 1 𝜋 𝐱 𝑡 𝐳≤𝑡 = det 𝐋 𝑡 𝐱 𝑡 σ෤𝐱 det 𝐋 𝑡 ෤𝐱 = det 𝐋 𝑡 𝐱 𝑡 det 𝐋 𝑡 + 𝐈 (C) Copyright IBM Corp. 2018 35 Sum over 2 𝑁 terms Determinant of 𝑁 × 𝑁 matrix
  • 36. Determinantal point process (DPP) has been studied for modeling and learning diversity • Probability of selecting a subset 𝒳 [Macchi 1975]: ℙ 𝒳 = det 𝐋 𝒳 σ ෡𝒳 det 𝐋 ෡𝒳 = det 𝐋 𝒳 det 𝐋 + 𝐈 • Sampling in 𝑂 𝑁 𝑘3 time [Hough+ 2006] • 𝑁: size of ground set • 𝑘: number of items in sampled subset (C) Copyright IBM Corp. 2018 36 [Kulesza & Taskar (2012)]
  • 37. Prior work has been applying DPPs in machine learning and artificial intelligence • Recommendation [Gillenwater+ 2014, Gartrell+ 2017] • Summarization of document, videos, … [Gong+ 2014] • Hyper-parameter optimization [Kathuria+ 2016] • Mini-batch sampling [Zhang+ 2017] • Modeling neural spiking [Snoek+ 2013] (C) Copyright IBM Corp. 2018 37
  • 38. Nearly exact Boltzmann exploration with MCMC [Kang 2013 (for 𝛽 = 1)] 1. Initialize 𝐱 2. Repeat • 𝐱 ← 𝐱′ with probability min 1, det 𝐋 𝑡 𝐱′ det 𝐋 𝑡 𝐱 𝛽 • When 𝐱’ and 𝐱 differs by one bit, det 𝐋 𝑡 𝐱′ can be computed from det 𝐋 𝑡 𝐱 via rank-one update techniques • Assume we can find 𝑎 ← 𝜓−1(𝐱) (C) Copyright IBM Corp. 2018 38
  • 39. Simpler approaches: Heuristics for more exploration and exploitation • For more exploration • Uniform with probability 𝜀 • DPP with probability 1 − 𝜀 • For more exploitation • Sample 𝑀 actions according to DPP • Choose the best among 𝑀 • Hard-core point processes [Matérn 1986] (C) Copyright IBM Corp. 2018 39
  • 40. Summary (C) Copyright IBM Corp. 2018 40
  • 41. (C) Copyright IBM Corp. 2018 41 Need diversity in team-actions Our definition of action similarity similar dissimilar low value when taken together high value when taken together Challenge 𝑄 𝜃 𝐳≤𝑡, 𝐱 𝑡 = 𝛼 + log det 𝐋 𝑡 𝐱 𝑡 Our solution Exponentially many team-actions Efficient learning ∇ 𝐕(𝐱 𝑡) 𝑄 𝜃 𝐳≤𝑡, 𝐱 𝑡 = 2 𝐕 𝐱 𝑡 + ⊤ ∇ 𝜙 𝑄 𝜃 𝐳≤𝑡, 𝐱 𝑡 = diag 𝐕 𝐱 𝑡 + 𝐕(𝐱 𝑡) ∇ 𝜙 𝐝 𝑡 𝜙 Efficient sampling (cf. DPP) 𝜋 𝐱 𝑡 𝐳≤𝑡 = exp 𝛽𝑄 𝜃 𝐳≤𝑡, 𝐱 𝑡 σ෤𝐱 exp 𝛽𝑄 𝜃 𝐳≤𝑡, ෤𝐱 = det 𝐋 𝑡 𝐱 𝑡 𝛽 σ෤𝐱 det 𝐋 𝑡 ෤𝐱 𝛽
  • 42. References • T. Osogami and R. Raymond, Determinantal reinforcement learning, AAAI-19 • T. Osogami and R. Raymond, Determinantal SARSA: Toward deep reinforcement learning for collaborative agents, NIPS 2017 Deep Reinforcement Learning Symposium • T. Osogami, R. Raymond, A. Goel, T. Shirai, and T. Maehara, Dynamic determinantal point processes, AAAI-18 • K. Zhao, T. Morimura, and T. Osogami, Event-based visual analytics for team-based invasion sports, PacificVis 2018 (C) Copyright IBM Corp. 2018 42