SlideShare a Scribd company logo
Safe	and	Efficient	Off-Policy	
Reinforcement	Learning
紹介者:岩城 諒
2016/11/12
@関西NIPS読み会
2
• 岩城諒
– 大阪大学 浅田研究室 D1
– ryo.iwaki@ams.eng.osaka-u.ac.jp
– 強化学習 ::	方策探索 ::	自然方策勾配法
自己紹介
Safe	and	Efficient	Off-Policy	
Reinforcement	Learning
Safe and efficient off-policy reinforcement learning
R´emi Munos
munos@google.com
Google DeepMind
Tom Stepleton
stepleton@google.com
Google DeepMind
Anna Harutyunyan
anna.harutyunyan@vub.ac.be
Vrije Universiteit Brussel
Marc G. Bellemare
bellemare@google.com
Google DeepMind
Abstract
In this work, we take a fresh look at some old and new algorithms for off-policy,
return-based reinforcement learning. Expressing these in a common form, we de-
rive a novel algorithm, Retrace( ), with three desired properties: (1) low variance;
(2) safety, as it safely uses samples collected from any behaviour policy, whatever
its degree of “off-policyness”; and (3) efficiency, as it makes the best use of sam-
ples collected from near on-policy behaviour policies. We analyse the contractive
nature of the related operator under both off-policy policy evaluation and control
settings and derive online sample-based algorithms. To our knowledge, this is the
first return-based off-policy control algorithm converging a.s. to Q⇤
without the
GLIE assumption (Greedy in the Limit with Infinite Exploration). As a corollary,
• Off-Policy 型強化学習
• 学習データを生成する方策が学習したい方策と異なる
• 任意の方策に対してサンプル効率良く収益を近似して
価値関数を学習するアルゴリズムの提案
4
• 状態
• 行動
• 状態遷移確率
• 報酬関数
• 方策
• Q 関数
• 遷移演算子
• 行動価値関数
• 学習目的:価値を最大化する方策の獲得
強化学習
エージェント
環境
状態 行動報酬
(P⇡
Q)(x, a) ,
X
x02X
X
a02A
P(x0
|x, a)⇡(a0
|x0
)Q(x0
, a0
)
Q⇡
,
X
t 0
t
(P⇡
)t
r, 2 [0, 1)
x 2 X
a 2 A
P : X ⇥ A ! Pr(X)
r : X ⇥ A ! [ RMAX, RMAX]
⇡ : X ! Pr(A)
Q : X ⇥ A ! R
5
• ベルマン演算子
– 行動価値 が不動点
• ベルマン最適演算子
– 最適価値関数 が不動点
– に対する greedy	な方策 → 最適方策
ベルマン方程式たち
T ⇡
Q , r + P⇡
Q
T Q , r + max
⇡
P⇡
Q
Q⇤
Q⇡
Q⇤ ⇡⇤
6
• 学習したい推定方策 とデータを生成した挙動方策 が
異なる
Off-policy
Convolution Convolution Fully connected Fully connected
No input
Schematic illustration of the convolutional neural network. The
he architecture are explained in the Methods. The input to the neural
onsists of an 843 843 4 image produced by the preprocessing
lowed by three convolutional layers (note: snaking blue line
symbolizes sliding of each filter across input image) and two fully connected
layers with a single output for each valid action. Each hidden layer is followed
by a rectifier nonlinearity (that is, max 0,xð Þ).
RCH LETTER
[Mnih+	15]
regression [2]. With a function approximator, the sampled
data from the approximated model can be generated by inap-
propriate interpolation or extrapolation that improperly up-
dates the policy parameters. In addition, if we aggressively
derive the analytical gradi-
ent of the approximated
model to update the poli-
cy, the approximated gra-
dient might be far from
the true gradient of the
objective function due to
the model approximation
error. If we consider using
these function approxima-
tion methods for high-di-
mensional systems like
humanoid robots, this
problem becomes more
serious due to the difficul-
ty of approximating high-
dimensional dynamics models with a limited amount of data
sampled from real systems. On the other hand, if the environ-
ment is extremely stochastic, a limited amount of previously
acquired data might not be able to capture the real environ-
ment’s property and could lead to inappropriate policy up-
dates. However, rigid dynamics models, such as a humanoid
robot model, do not usually include large stochasticity. There-
fore, our approach is suitable for a real robot learning for high-
dimensional systems like humanoid robots.
Moreover, applying RL to actu
since it usually requires many learn
cuted in real environments, and t
limited. Previous studies used prio
signed initial trajectories to apply
proved the robot controller’s param
We applied our proposed learn
oid robot [7] (Figure 13) and show
different movement-learning task
edge for the cart-pole swing-up
nominal trajectory for the basketb
The proposed recursive use of
improve policies for real robots
other policy search algorithms, s
gression [11] or information theo
might be interesting to investiga
work as a future study.
Conclusions
In this article, we proposed reusi
es of a humanoid robot to effici
formance. We proposed recurs
PGPE method to improve the p
proach to cart-pole swing-up
tasks. In the former, we introd
task environment composed of a
tually simulated cart-pole dyna
environment, we can potentiall
different task environments. N
movements of the humanoid ro
the cart-pole swing-up. Furthe
posed method, the challenging
was successfully accomplished.
Future work will develop a m
learning [28] approach to efficien
riences acquired in different target
Acknowledgment
This work was supported by
23120004, MIC-SCOPE, ``Devel
gies for Clinical Application’’ ca
AMED, and NEDO. Part of this s
KAKENHI Grant 26730141. This
NSFC 61502339.
References
[1] A. G. Kupcsik, M. P. Deisenroth, J. Pe
cient contextual policy search for robot m
Conf. Artificial Intelligence, 2013.
[2] C. E. Rasmussen and C. K. I. William
Learning. Cambridge, MA: MIT Press, 2006
[3] C. G. Atkeson and S. Schaal, “Robot lea
14th Int. Conf. Machine Learning, 1997, pp. 12
[4] C. G. Atkeson and J. Morimoto, “Non
cies and value functions: A trajectory-base
mation Processing Systems, 2002, pp. 1643–
Efficiently reusing previous
experiences is crucial to
improve its behavioral
policies without actually
interacting with real
environments.
Figure 13. The humanoid robot CB-i [7]. (Photo courtesy of ATR.)
[Sugimoto+	16]
O↵-Policy Actor-Critic
nd the value function
sponding TD-solution
imilar outline to the
policy policy gradient
009) and for nonlinear
analyze the dynamics
= (wt
T
vt
T
), based on
volves satisfying seven
, p. 64) to ensure con-
able equilibrium.
rmance of O↵-PAC to
s with linear memory
1) Q( ) (called Q-
edy-GQ (GQ( ) with
Behavior Greedy-GQ
Softmax-GQ O↵-PAC
Figure 1. Example of one trajectory for each algorithm
in the continuous 2D grid world environment after 5,000
milar outline to the
olicy policy gradient
09) and for nonlinear
nalyze the dynamics
= (wt
T
vt
T
), based on
olves satisfying seven
p. 64) to ensure con-
able equilibrium.
mance of O↵-PAC to
with linear memory
1) Q( ) (called Q-
dy-GQ (GQ( ) with
Softmax-GQ (GQ( )
he policy in O↵-PAC
d in section 2.3.
ntain car, a pendulum
world. These prob-
space and a continu-
Softmax-GQ O↵-PAC
Figure 1. Example of one trajectory for each algorithm
in the continuous 2D grid world environment after 5,000
learning episodes from the behavior policy. O↵-PAC is the
only algorithm that learned to reach the goal reliably.
The last problem is a continuous grid-world. The
state is a 2-dimensional position in [0, 1]2
. The ac-
tions are the pairs {(0.0, 0.0), ( .05, 0.0), (.05, 0.0),
(0.0, .05), (0.0, .05)}, representing moves in both di-
[Degris+	12]
⇡ µ
E
µ
[·] 6= E
⇡
[·]
7
• Forward	view
– 真の収益を target	に学習したい
– できない
• Backward	view
– 過去に経験した状態・行動対を履歴として保持して
現在の報酬を分配
• backward	view	=	forward	view	[Sutton	&	Barto 98]
Return	Based	Reinforcement	Learning
tX
s=0
( )t s
I{xj, aj = x, a}
Q(xt, at) (1 ↵)Q(xt, at) + ↵
X
s t
s t
(P⇡
)s t
r
Eligibility	Trace
X
t 0
t
(P⇡
)t
r
rt+1 rt+2 rt+n
(xt, at, rt)
t
n 3
2
(xt n, at n) (xt 1, at 1)
8
• λ 収益 :	n	ステップ収益の指数平均
• が TD(0)	と Monte	Carlo	法をつなぐ
λ 収益と TD(λ)
T ⇡
=0 = T ⇡
Q
T ⇡
=1 = Q⇡
T ⇡
, (1 )
X
n 0
n
[(T ⇡
)n
Q] = Q + (I P⇡
) 1
(T ⇡
Q Q)
2 [0, 1]
X
t 0
t
(P⇡
)t
r
rt+1 rt+2 rt+n
(xt, at, rt)
t
n 3
2
(xt n, at n) (xt 1, at 1)
9
• 軌跡 の生成確率は
方策ごとに大きく違う
– 保持している trace	がうそっぱち
• 学習サンプルをどう重み付け /	修正すれば
収益を効率良く近似できるか?
• 目的
– safe:	任意の と に対して収束する
– efficient:	サンプル効率よく収益を近似
Off-policy,	Return	Based
⇡(a|x) µ(a|x)
Ft = (x0, a0, r0, x1, . . . , xt, at, rt)
10
Off-Policy	Return	Operator	(Forward	View)
E
⇡
Q(x, ·) ,
X
a
⇡(a|x)Q(x, a)
軌跡生成確率の重み付け
推定方策で報酬の値を補正
RQ(x, a) , Q(x, a) + E
µ
2
4
X
t 0
t
⇣ tY
s=1
cs
⌘
rt+ E
⇡
Q(xt+1, ·) Q(xt, at)
3
5
• 演算子 を用いて
– 関連研究を整理
– 新しいアルゴリズムの提案
R
11
• Importance	Sampling(IS)
– 単純な重点重み
– 報酬の補正はしない
✘ の分散が大きい
関連研究の整理 1
[Precup+	00;	01;	Geist	&	Scherrer 14]
cs =
⇡(as|xs)
µ(as|xs)
RQ(x, a) , Q(x, a) + E
µ
2
4
X
t 0
t
⇣ tY
s=1
cs
⌘
rt+ E
⇡
Q(xt+1, ·) Q(xt, at)
3
5
RIS
Q(x, a) , Q(x, a) + E
µ
2
4
X
t 0
t
⇣ tY
s=1
⇡(as|xs)
µ(as|xs)
⌘
rt
3
5
X
t 0
t
⇣ tY
s=1
⇡(as|xs)
µ(as|xs)
⌘
⇡ µ
a
12
• Q(λ)
– 古典的な eligibility	trace
– 推定方策で即時報酬を補正
✘ 軌跡を重み付けしないため
と が大きく異なると収束が保証されない
関連研究の整理 2
[Harutyunyan+	16]
cs =
RQ(x, a) , Q(x, a) + E
µ
2
4
X
t 0
t
⇣ tY
s=1
cs
⌘
rt+ E
⇡
Q(xt+1, ·) Q(xt, at)
3
5
R Q(x, a) , Q(x, a) + E
µ
2
4
X
t 0
( )t
h
rt + E
⇡
Q(xt+1, ·) Q(xt, at)
i
3
5
⇡(a|x) µ(a|x)
max
x
k⇡(·|x) µ(·|x)k1 
1
unsafe
13
• Tree	Backup	(TB)
– 推定方策そのもので軌跡を重み付け
ü 挙動方策が非マルコフでも学習可能
✘ と が近い場合(On-Policy)
トレースの減衰が速く,収益を効率良く近似できない
関連研究の整理 3
cs = ⇡(as|xs)
[Precup+	00]
RQ(x, a) , Q(x, a) + E
µ
2
4
X
t 0
t
⇣ tY
s=1
cs
⌘
rt+ E
⇡
Q(xt+1, ·) Q(xt, at)
3
5
R Q(x, a) , Q(x, a) + E
µ
2
4
X
t 0
( )t
⇣ tY
s=1
⇡(as|xs)
⌘ h
rt + E
⇡
Q(xt+1, ·) Q(xt, at)
i
3
5
⇡(a|x) µ(a|x)
inefficient
14
• 1で打ち切られた 重点重み
ü が有限の値をとる
ü On-Policyに近い場合でも trace	が減衰しにくい
提案手法 ::	Retrace(λ)
cs = min
✓
1,
⇡(as|xs)
µ(as|xs)
◆
min
✓
1,
⇡(as|xs)
µ(as|xs)
◆
⇡(as|xs)
RQ(x, a) , Q(x, a) + E
µ
2
4
X
t 0
t
⇣ tY
s=1
cs
⌘
rt+ E
⇡
Q(xt+1, ·) Q(xt, at)
3
5
V
✓X
t
t
tY
s=1
cs
◆
⇡ µ
a
15
関連研究との比較
Definition Estimation Guaranteed Use full returns
of cs variance convergence† (near on-policy)
Importance sampling ⇡(as|xs)
µ(as|xs) High for any ⇡, µ yes
Q( ) Low only for ⇡ close to µ yes
TB( ) ⇡(as|xs) Low for any ⇡, µ no
Retrace( ) min(1, ⇡(as|xs)
µ(as|xs) ) Low for any ⇡, µ yes
Table 1: Properties of several algorithms defined in terms of the general operator given in (4).
†Guaranteed convergence of the expected operator R.
• The online Retrace( ) algorithm converges a.s. to Q⇤
in the control case. In the control
Safe	 Efficient
RQ(x, a) , Q(x, a) + E
µ
2
4
X
t 0
t
⇣ tY
s=1
cs
⌘
rt+ E
⇡
Q(xt+1, ·) Q(xt, at)
3
5
16
• 任意の挙動方策 と
increasingly	greedy	な推定方策 に対して,
trace	がマルコフ性を満たし
悲観的初期値 であれば
• さらに,推定方策が greedy	方策に漸近すれば
Retrace(λ)	の収束
µk
⇡k
kRkQk Q⇤
k  kQk Q⇤
k + "kkQkk
Qk ! Q⇤
("k ! 0)
cs = c(as, xs) 2

0,
⇡(as|xs)
µ(as|xs)
T ⇡0
Q0 Q0
17
• 不動点 の存在を示す
• が縮小写像であることを示す
不動点定理による収束証明
R
Q⇤
kRkQk Q⇤
k  kQk Q⇤
k
R
Q
Q
R Q
Q⇤
Q⇤
Q⇤
Q = Q⇤
⇡ cµ 0cs 2

0,
⇡(as|xs)
µ(as|xs)
18
• Off-Policy	Return	Operator	(forward	view)
• Algorithm	(backward	view)
Forward	View	= Backward	View
Qk+1(x, a) Qk(x, a) + ↵k
X
t s
⇡k
t
tX
j=s
t j
⇣ tY
i=j+1
ci
⌘
I{xj, aj = x, a}
⇡k
t = rt + E
⇡k
Qk(xt+1, ·) Qk(xt, at)
Qk+1(x, a) = (1 ↵k)Qk(x, a) + ↵kRkQk(x, a)
RkQk(x, a) = Qk(x, a) + E
µk
2
4
X
t 0
t
⇣ tY
s=1
cs
⌘
rt + E
⇡k
Qk(xt+1, ·) Qk(xt, at)
3
5
=
19
• CNN	による Q	の近似
• 16スレッド並列の学習
1. θ’	← θ
2. 各スレッドでθ’から勾配を計算
3. 全ての勾配を利用してθを更新
• Experience	Replay
– 一度の更新に使うサンプル数は揃える (64)
– Retrace,	TB,	Q*	では (連続した16の状態遷移 x	4)
実験 ::	Atari	2600	+	DQN
difficult and engaging for human players. We used the same network
architecture, hyperparameter values (see Extended Data Table 1) and
learningprocedurethroughout—takinghigh-dimensionaldata(210|160
colour video at 60 Hz) as input—to demonstrate that our approach
robustly learns successful policies over a variety of games based solely
onsensoryinputswithonlyveryminimalpriorknowledge(thatis,merely
the input data were visual images, and the number of actions available
in each game, but not their correspondences; see Methods). Notably,
our method was able to train large neural networks using a reinforce-
mentlearningsignalandstochasticgradientdescentinastablemanner—
illustrated by the temporal evolution of two indices of learning (the
agent’s average score-per-episode and average predicted Q-values; see
Fig. 2 and Supplementary Discussion for details).
We compared DQN with the best performing meth
reinforcement learning literature on the 49 games whe
available12,15
. In addition to the learned agents, we alsore
aprofessionalhumangamestesterplayingundercontro
and a policy that selects actions uniformly at random (E
Table 2 and Fig. 3, denoted by 100% (human) and 0% (
axis; see Methods). Our DQN method outperforms th
reinforcement learning methods on 43 of the games wit
rating any of the additional prior knowledge about Ata
used by other approaches (for example, refs 12, 15). Fur
DQN agent performed at a level that was comparable to
fessionalhumangamestesteracrossthesetof49games,a
than75%ofthe humanscore onmorethanhalfofthegam
Convolution Convolution Fully connected Fully connected
No input
Figure 1 | Schematic illustration of the convolutional neural network. The
details of the architecture are explained in the Methods. The input to the neural
network consists of an 843 843 4 image produced by the preprocessing
map w, followed by three convolutional layers (note: snaking blue line
symbolizes sliding of each filter across input image) and two f
layers with a single output for each valid action. Each hidden
by a rectifier nonlinearity (that is, max 0,xð Þ).
a b
0
200
400
600
800
1,000
1,200
1,400
1,600
1,800
2,000
2,200
Averagescoreperepisode
0
1,000
2,000
3,000
4,000
5,000
6,000
Averagescoreperepisode
RESEARCH LETTER
[Mnih+	15]
[Mnih+	16]
DQN
Q(xt, at; ✓) = r(xt, at) + max
a0
Q(xt+1, a0
; ✓ ) Q(xt, at; ✓)
✓ ✓ + ↵t Q(xt, at; ✓)
@Q(xt, at; ✓)
@✓
Q(xt, at; ✓) =
k 1X
s=t
s t
⇣ sY
i=t+1
ci
⌘⇥
r(xs, as) + E
⇡
Q(xs+1, ·; ✓ ) Q(xs, as; ✓)
⇤
20
• 60	種類のゲームでのパフォーマンスを比較
• :	ゲーム g	におけるアルゴリズム a	のスコア
実験結果
0.00.20.40.60.81.0
0.0
0.2
0.4
0.6
0.8
1.0
FractionofGames
Inter-algorithm Score
40M TRAINING FRAMES
Q*
Retrace
Tree-backup
Q-Learning
0.00.20.40.60.81.0
0.0
0.2
0.4
0.6
0.8
1.0
FractionofGames
Inter-algorithm Score
200M TRAINING FRAMES
Retrace
Tree-backup
Q-Learning
f(x) , |{g : zg,a x}|/60
zg,a 2 [0, 1]
21
• 既存の return	based	off-policy	手法に対して
統一的な見方を与えた.
• 任意の挙動方策に対するサンプル効率の良い
return	based	off-policy	を提案した.
– トレース係数を1で打ち切ることで分散を抑えた
• Atari	2600	において,既存の return	based	off-policy	手法よりも
高い学習性能を示した.
まとめ
22
• c が必ず 1	より小さいため trace	の減衰が速い.
• 分散を小さく保ちながら最大の trace	を利用するためには
例えば
• しかし c	が非マルコフ的になるため収束の保証が崩れる.
Future	Work
cs = min
✓
1
c1 . . . cs 1
,
⇡(as|xs)
µ(as|xs)
◆

More Related Content

What's hot

多様な強化学習の概念と課題認識
多様な強化学習の概念と課題認識多様な強化学習の概念と課題認識
多様な強化学習の概念と課題認識
佑 甲野
 
DQNからRainbowまで 〜深層強化学習の最新動向〜
DQNからRainbowまで 〜深層強化学習の最新動向〜DQNからRainbowまで 〜深層強化学習の最新動向〜
DQNからRainbowまで 〜深層強化学習の最新動向〜
Jun Okumura
 
グラフニューラルネットワークとグラフ組合せ問題
グラフニューラルネットワークとグラフ組合せ問題グラフニューラルネットワークとグラフ組合せ問題
グラフニューラルネットワークとグラフ組合せ問題
joisino
 
タクシー運行最適化を実現する機械学習システムの社会実装
タクシー運行最適化を実現する機械学習システムの社会実装タクシー運行最適化を実現する機械学習システムの社会実装
タクシー運行最適化を実現する機械学習システムの社会実装
RyuichiKanoh
 
強化学習の基礎と深層強化学習(東京大学 松尾研究室 深層強化学習サマースクール講義資料)
強化学習の基礎と深層強化学習(東京大学 松尾研究室 深層強化学習サマースクール講義資料)強化学習の基礎と深層強化学習(東京大学 松尾研究室 深層強化学習サマースクール講義資料)
強化学習の基礎と深層強化学習(東京大学 松尾研究室 深層強化学習サマースクール講義資料)
Shota Imai
 
「世界モデル」と関連研究について
「世界モデル」と関連研究について「世界モデル」と関連研究について
「世界モデル」と関連研究について
Masahiro Suzuki
 
IIBMP2016 深層生成モデルによる表現学習
IIBMP2016 深層生成モデルによる表現学習IIBMP2016 深層生成モデルによる表現学習
IIBMP2016 深層生成モデルによる表現学習
Preferred Networks
 
差分プライバシーとは何か? (定義 & 解釈編)
差分プライバシーとは何か? (定義 & 解釈編)差分プライバシーとは何か? (定義 & 解釈編)
差分プライバシーとは何か? (定義 & 解釈編)
Kentaro Minami
 
Active Learning 入門
Active Learning 入門Active Learning 入門
Active Learning 入門Shuyo Nakatani
 
強化学習その2
強化学習その2強化学習その2
強化学習その2
nishio
 
Bayesian Neural Networks : Survey
Bayesian Neural Networks : SurveyBayesian Neural Networks : Survey
Bayesian Neural Networks : Survey
tmtm otm
 
強化学習アルゴリズムPPOの解説と実験
強化学習アルゴリズムPPOの解説と実験強化学習アルゴリズムPPOの解説と実験
強化学習アルゴリズムPPOの解説と実験
克海 納谷
 
スパースモデリングによる多次元信号・画像復元
スパースモデリングによる多次元信号・画像復元スパースモデリングによる多次元信号・画像復元
スパースモデリングによる多次元信号・画像復元
Shogo Muramatsu
 
[DL輪読会]Temporal Abstraction in NeurIPS2019
[DL輪読会]Temporal Abstraction in NeurIPS2019[DL輪読会]Temporal Abstraction in NeurIPS2019
[DL輪読会]Temporal Abstraction in NeurIPS2019
Deep Learning JP
 
【論文紹介】How Powerful are Graph Neural Networks?
【論文紹介】How Powerful are Graph Neural Networks?【論文紹介】How Powerful are Graph Neural Networks?
【論文紹介】How Powerful are Graph Neural Networks?
Masanao Ochi
 
方策勾配型強化学習の基礎と応用
方策勾配型強化学習の基礎と応用方策勾配型強化学習の基礎と応用
方策勾配型強化学習の基礎と応用
Ryo Iwaki
 
Control as Inference (強化学習とベイズ統計)
Control as Inference (強化学習とベイズ統計)Control as Inference (強化学習とベイズ統計)
Control as Inference (強化学習とベイズ統計)
Shohei Taniguchi
 
Skip Connection まとめ(Neural Network)
Skip Connection まとめ(Neural Network)Skip Connection まとめ(Neural Network)
Skip Connection まとめ(Neural Network)
Yamato OKAMOTO
 
POMDP下での強化学習の基礎と応用
POMDP下での強化学習の基礎と応用POMDP下での強化学習の基礎と応用
POMDP下での強化学習の基礎と応用
Yasunori Ozaki
 
[DL輪読会]マルチエージェント強化学習と⼼の理論 〜Hanabiゲームにおけるベイズ推論を⽤いたマルチエージェント 強化学習⼿法〜
[DL輪読会]マルチエージェント強化学習と⼼の理論 〜Hanabiゲームにおけるベイズ推論を⽤いたマルチエージェント 強化学習⼿法〜 [DL輪読会]マルチエージェント強化学習と⼼の理論 〜Hanabiゲームにおけるベイズ推論を⽤いたマルチエージェント 強化学習⼿法〜
[DL輪読会]マルチエージェント強化学習と⼼の理論 〜Hanabiゲームにおけるベイズ推論を⽤いたマルチエージェント 強化学習⼿法〜
Deep Learning JP
 

What's hot (20)

多様な強化学習の概念と課題認識
多様な強化学習の概念と課題認識多様な強化学習の概念と課題認識
多様な強化学習の概念と課題認識
 
DQNからRainbowまで 〜深層強化学習の最新動向〜
DQNからRainbowまで 〜深層強化学習の最新動向〜DQNからRainbowまで 〜深層強化学習の最新動向〜
DQNからRainbowまで 〜深層強化学習の最新動向〜
 
グラフニューラルネットワークとグラフ組合せ問題
グラフニューラルネットワークとグラフ組合せ問題グラフニューラルネットワークとグラフ組合せ問題
グラフニューラルネットワークとグラフ組合せ問題
 
タクシー運行最適化を実現する機械学習システムの社会実装
タクシー運行最適化を実現する機械学習システムの社会実装タクシー運行最適化を実現する機械学習システムの社会実装
タクシー運行最適化を実現する機械学習システムの社会実装
 
強化学習の基礎と深層強化学習(東京大学 松尾研究室 深層強化学習サマースクール講義資料)
強化学習の基礎と深層強化学習(東京大学 松尾研究室 深層強化学習サマースクール講義資料)強化学習の基礎と深層強化学習(東京大学 松尾研究室 深層強化学習サマースクール講義資料)
強化学習の基礎と深層強化学習(東京大学 松尾研究室 深層強化学習サマースクール講義資料)
 
「世界モデル」と関連研究について
「世界モデル」と関連研究について「世界モデル」と関連研究について
「世界モデル」と関連研究について
 
IIBMP2016 深層生成モデルによる表現学習
IIBMP2016 深層生成モデルによる表現学習IIBMP2016 深層生成モデルによる表現学習
IIBMP2016 深層生成モデルによる表現学習
 
差分プライバシーとは何か? (定義 & 解釈編)
差分プライバシーとは何か? (定義 & 解釈編)差分プライバシーとは何か? (定義 & 解釈編)
差分プライバシーとは何か? (定義 & 解釈編)
 
Active Learning 入門
Active Learning 入門Active Learning 入門
Active Learning 入門
 
強化学習その2
強化学習その2強化学習その2
強化学習その2
 
Bayesian Neural Networks : Survey
Bayesian Neural Networks : SurveyBayesian Neural Networks : Survey
Bayesian Neural Networks : Survey
 
強化学習アルゴリズムPPOの解説と実験
強化学習アルゴリズムPPOの解説と実験強化学習アルゴリズムPPOの解説と実験
強化学習アルゴリズムPPOの解説と実験
 
スパースモデリングによる多次元信号・画像復元
スパースモデリングによる多次元信号・画像復元スパースモデリングによる多次元信号・画像復元
スパースモデリングによる多次元信号・画像復元
 
[DL輪読会]Temporal Abstraction in NeurIPS2019
[DL輪読会]Temporal Abstraction in NeurIPS2019[DL輪読会]Temporal Abstraction in NeurIPS2019
[DL輪読会]Temporal Abstraction in NeurIPS2019
 
【論文紹介】How Powerful are Graph Neural Networks?
【論文紹介】How Powerful are Graph Neural Networks?【論文紹介】How Powerful are Graph Neural Networks?
【論文紹介】How Powerful are Graph Neural Networks?
 
方策勾配型強化学習の基礎と応用
方策勾配型強化学習の基礎と応用方策勾配型強化学習の基礎と応用
方策勾配型強化学習の基礎と応用
 
Control as Inference (強化学習とベイズ統計)
Control as Inference (強化学習とベイズ統計)Control as Inference (強化学習とベイズ統計)
Control as Inference (強化学習とベイズ統計)
 
Skip Connection まとめ(Neural Network)
Skip Connection まとめ(Neural Network)Skip Connection まとめ(Neural Network)
Skip Connection まとめ(Neural Network)
 
POMDP下での強化学習の基礎と応用
POMDP下での強化学習の基礎と応用POMDP下での強化学習の基礎と応用
POMDP下での強化学習の基礎と応用
 
[DL輪読会]マルチエージェント強化学習と⼼の理論 〜Hanabiゲームにおけるベイズ推論を⽤いたマルチエージェント 強化学習⼿法〜
[DL輪読会]マルチエージェント強化学習と⼼の理論 〜Hanabiゲームにおけるベイズ推論を⽤いたマルチエージェント 強化学習⼿法〜 [DL輪読会]マルチエージェント強化学習と⼼の理論 〜Hanabiゲームにおけるベイズ推論を⽤いたマルチエージェント 強化学習⼿法〜
[DL輪読会]マルチエージェント強化学習と⼼の理論 〜Hanabiゲームにおけるベイズ推論を⽤いたマルチエージェント 強化学習⼿法〜
 

Viewers also liked

increasing the action gap - new operators for reinforcement learning
increasing the action gap - new operators for reinforcement learningincreasing the action gap - new operators for reinforcement learning
increasing the action gap - new operators for reinforcement learning
Ryo Iwaki
 
Safe and Efficient Off-Policy Reinforcement Learning
Safe and Efficient Off-Policy Reinforcement LearningSafe and Efficient Off-Policy Reinforcement Learning
Safe and Efficient Off-Policy Reinforcement Learning
mooopan
 
Reinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialReinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners Tutorial
Omar Enayet
 
関西NIPS+読み会発表スライド
関西NIPS+読み会発表スライド関西NIPS+読み会発表スライド
関西NIPS+読み会発表スライド
Yuchi Matsuoka
 
セブ語学留学
セブ語学留学セブ語学留学
セブ語学留学
Megumi Ogawa
 
Distributed Stochastic Gradient MCMC
Distributed Stochastic Gradient MCMCDistributed Stochastic Gradient MCMC
Distributed Stochastic Gradient MCMC
Kaede Hayashi
 
Quantum Computing. 10,000x faster (Vasyl Mylko Technology Stream)
Quantum Computing. 10,000x faster (Vasyl Mylko Technology Stream)Quantum Computing. 10,000x faster (Vasyl Mylko Technology Stream)
Quantum Computing. 10,000x faster (Vasyl Mylko Technology Stream)
IT Arena
 
Quantum computing
Quantum computingQuantum computing
Quantum computing
Llewellyn Falco
 
Quantum Computing - Challenges in the field of security
Quantum Computing - Challenges in the field of securityQuantum Computing - Challenges in the field of security
Quantum Computing - Challenges in the field of security
Navin Pai
 
Artificial Intelligence is back, Deep Learning Networks and Quantum possibili...
Artificial Intelligence is back, Deep Learning Networks and Quantum possibili...Artificial Intelligence is back, Deep Learning Networks and Quantum possibili...
Artificial Intelligence is back, Deep Learning Networks and Quantum possibili...
John Mathon
 
The Fastest Possible Search Algorithm: Grover's Search and the World of Quant...
The Fastest Possible Search Algorithm: Grover's Search and the World of Quant...The Fastest Possible Search Algorithm: Grover's Search and the World of Quant...
The Fastest Possible Search Algorithm: Grover's Search and the World of Quant...
Daniel Austin
 
Quantum computers
Quantum computersQuantum computers
Quantum computers
Ajith Rao
 
Composing graphical models with neural networks for structured representation...
Composing graphical models with neural networks for structured representation...Composing graphical models with neural networks for structured representation...
Composing graphical models with neural networks for structured representation...
Kaede Hayashi
 
Geeknight : Artificial Intelligence and Machine Learning
Geeknight : Artificial Intelligence and Machine LearningGeeknight : Artificial Intelligence and Machine Learning
Geeknight : Artificial Intelligence and Machine Learning
Hyderabad Scalability Meetup
 
Quantum Computing
Quantum ComputingQuantum Computing
Quantum Computing
Owen Wang
 
Effects of Reinforcement in the Classroom
Effects of Reinforcement in the ClassroomEffects of Reinforcement in the Classroom
Effects of Reinforcement in the Classroom
AMaciocia
 
Quantum Computing in a Nutshell: Grover's Search and the World of Quantum Com...
Quantum Computing in a Nutshell: Grover's Search and the World of Quantum Com...Quantum Computing in a Nutshell: Grover's Search and the World of Quantum Com...
Quantum Computing in a Nutshell: Grover's Search and the World of Quantum Com...
Daniel Austin
 
Quantum computing
Quantum computingQuantum computing
Quantum computing
Shaik Azar
 
Using Reinforcement in the Classroom
Using Reinforcement in the ClassroomUsing Reinforcement in the Classroom
Using Reinforcement in the Classroom
sworaac
 
Stochastic Variational Inference
Stochastic Variational InferenceStochastic Variational Inference
Stochastic Variational Inference
Kaede Hayashi
 

Viewers also liked (20)

increasing the action gap - new operators for reinforcement learning
increasing the action gap - new operators for reinforcement learningincreasing the action gap - new operators for reinforcement learning
increasing the action gap - new operators for reinforcement learning
 
Safe and Efficient Off-Policy Reinforcement Learning
Safe and Efficient Off-Policy Reinforcement LearningSafe and Efficient Off-Policy Reinforcement Learning
Safe and Efficient Off-Policy Reinforcement Learning
 
Reinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialReinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners Tutorial
 
関西NIPS+読み会発表スライド
関西NIPS+読み会発表スライド関西NIPS+読み会発表スライド
関西NIPS+読み会発表スライド
 
セブ語学留学
セブ語学留学セブ語学留学
セブ語学留学
 
Distributed Stochastic Gradient MCMC
Distributed Stochastic Gradient MCMCDistributed Stochastic Gradient MCMC
Distributed Stochastic Gradient MCMC
 
Quantum Computing. 10,000x faster (Vasyl Mylko Technology Stream)
Quantum Computing. 10,000x faster (Vasyl Mylko Technology Stream)Quantum Computing. 10,000x faster (Vasyl Mylko Technology Stream)
Quantum Computing. 10,000x faster (Vasyl Mylko Technology Stream)
 
Quantum computing
Quantum computingQuantum computing
Quantum computing
 
Quantum Computing - Challenges in the field of security
Quantum Computing - Challenges in the field of securityQuantum Computing - Challenges in the field of security
Quantum Computing - Challenges in the field of security
 
Artificial Intelligence is back, Deep Learning Networks and Quantum possibili...
Artificial Intelligence is back, Deep Learning Networks and Quantum possibili...Artificial Intelligence is back, Deep Learning Networks and Quantum possibili...
Artificial Intelligence is back, Deep Learning Networks and Quantum possibili...
 
The Fastest Possible Search Algorithm: Grover's Search and the World of Quant...
The Fastest Possible Search Algorithm: Grover's Search and the World of Quant...The Fastest Possible Search Algorithm: Grover's Search and the World of Quant...
The Fastest Possible Search Algorithm: Grover's Search and the World of Quant...
 
Quantum computers
Quantum computersQuantum computers
Quantum computers
 
Composing graphical models with neural networks for structured representation...
Composing graphical models with neural networks for structured representation...Composing graphical models with neural networks for structured representation...
Composing graphical models with neural networks for structured representation...
 
Geeknight : Artificial Intelligence and Machine Learning
Geeknight : Artificial Intelligence and Machine LearningGeeknight : Artificial Intelligence and Machine Learning
Geeknight : Artificial Intelligence and Machine Learning
 
Quantum Computing
Quantum ComputingQuantum Computing
Quantum Computing
 
Effects of Reinforcement in the Classroom
Effects of Reinforcement in the ClassroomEffects of Reinforcement in the Classroom
Effects of Reinforcement in the Classroom
 
Quantum Computing in a Nutshell: Grover's Search and the World of Quantum Com...
Quantum Computing in a Nutshell: Grover's Search and the World of Quantum Com...Quantum Computing in a Nutshell: Grover's Search and the World of Quantum Com...
Quantum Computing in a Nutshell: Grover's Search and the World of Quantum Com...
 
Quantum computing
Quantum computingQuantum computing
Quantum computing
 
Using Reinforcement in the Classroom
Using Reinforcement in the ClassroomUsing Reinforcement in the Classroom
Using Reinforcement in the Classroom
 
Stochastic Variational Inference
Stochastic Variational InferenceStochastic Variational Inference
Stochastic Variational Inference
 

Similar to safe and efficient off policy reinforcement learning

自然方策勾配法の基礎と応用
自然方策勾配法の基礎と応用自然方策勾配法の基礎と応用
自然方策勾配法の基礎と応用
Ryo Iwaki
 
Scalable trust-region method for deep reinforcement learning using Kronecker-...
Scalable trust-region method for deep reinforcement learning using Kronecker-...Scalable trust-region method for deep reinforcement learning using Kronecker-...
Scalable trust-region method for deep reinforcement learning using Kronecker-...
Willy Marroquin (WillyDevNET)
 
Accelerated Particle Swarm Optimization and Support Vector Machine for Busine...
Accelerated Particle Swarm Optimization and Support Vector Machine for Busine...Accelerated Particle Swarm Optimization and Support Vector Machine for Busine...
Accelerated Particle Swarm Optimization and Support Vector Machine for Busine...
Xin-She Yang
 
Summary of BRAC
Summary of BRACSummary of BRAC
Summary of BRAC
ssuser0e9ad8
 
Applications and Analysis of Bio-Inspired Eagle Strategy for Engineering Opti...
Applications and Analysis of Bio-Inspired Eagle Strategy for Engineering Opti...Applications and Analysis of Bio-Inspired Eagle Strategy for Engineering Opti...
Applications and Analysis of Bio-Inspired Eagle Strategy for Engineering Opti...
Xin-She Yang
 
Batch mode reinforcement learning based on the synthesis of artificial trajec...
Batch mode reinforcement learning based on the synthesis of artificial trajec...Batch mode reinforcement learning based on the synthesis of artificial trajec...
Batch mode reinforcement learning based on the synthesis of artificial trajec...
Université de Liège (ULg)
 
5. 8519 1-pb
5. 8519 1-pb5. 8519 1-pb
5. 8519 1-pb
IAESIJEECS
 
CLIM Program: Remote Sensing Workshop, Optimization for Distributed Data Syst...
CLIM Program: Remote Sensing Workshop, Optimization for Distributed Data Syst...CLIM Program: Remote Sensing Workshop, Optimization for Distributed Data Syst...
CLIM Program: Remote Sensing Workshop, Optimization for Distributed Data Syst...
The Statistical and Applied Mathematical Sciences Institute
 
Beyond function approximators for batch mode reinforcement learning: rebuildi...
Beyond function approximators for batch mode reinforcement learning: rebuildi...Beyond function approximators for batch mode reinforcement learning: rebuildi...
Beyond function approximators for batch mode reinforcement learning: rebuildi...
Université de Liège (ULg)
 
Accelerated Particle Swarm Optimization and Support Vector Machine for Busine...
Accelerated Particle Swarm Optimization and Support Vector Machine for Busine...Accelerated Particle Swarm Optimization and Support Vector Machine for Busine...
Accelerated Particle Swarm Optimization and Support Vector Machine for Busine...
Xin-She Yang
 
BPstudy sklearn 20180925
BPstudy sklearn 20180925BPstudy sklearn 20180925
BPstudy sklearn 20180925
Shintaro Fukushima
 
LOGNORMAL ORDINARY KRIGING METAMODEL IN SIMULATION OPTIMIZATION
LOGNORMAL ORDINARY KRIGING METAMODEL IN SIMULATION OPTIMIZATIONLOGNORMAL ORDINARY KRIGING METAMODEL IN SIMULATION OPTIMIZATION
LOGNORMAL ORDINARY KRIGING METAMODEL IN SIMULATION OPTIMIZATION
orajjournal
 
Optimizing a New Nonlinear Reinforcement Scheme with Breeder genetic algorithm
Optimizing a New Nonlinear Reinforcement Scheme with Breeder genetic algorithmOptimizing a New Nonlinear Reinforcement Scheme with Breeder genetic algorithm
Optimizing a New Nonlinear Reinforcement Scheme with Breeder genetic algorithm
infopapers
 
USING LEARNING AUTOMATA AND GENETIC ALGORITHMS TO IMPROVE THE QUALITY OF SERV...
USING LEARNING AUTOMATA AND GENETIC ALGORITHMS TO IMPROVE THE QUALITY OF SERV...USING LEARNING AUTOMATA AND GENETIC ALGORITHMS TO IMPROVE THE QUALITY OF SERV...
USING LEARNING AUTOMATA AND GENETIC ALGORITHMS TO IMPROVE THE QUALITY OF SERV...
IJCSEA Journal
 
AUTOMATIC TRANSFER RATE ADJUSTMENT FOR TRANSFER REINFORCEMENT LEARNING
AUTOMATIC TRANSFER RATE ADJUSTMENT FOR TRANSFER REINFORCEMENT LEARNINGAUTOMATIC TRANSFER RATE ADJUSTMENT FOR TRANSFER REINFORCEMENT LEARNING
AUTOMATIC TRANSFER RATE ADJUSTMENT FOR TRANSFER REINFORCEMENT LEARNING
gerogepatton
 
Introduction to Reinforcement Learning for Molecular Design
Introduction to Reinforcement Learning for Molecular Design Introduction to Reinforcement Learning for Molecular Design
Introduction to Reinforcement Learning for Molecular Design
Dan Elton
 
Playing Atari with Deep Reinforcement Learning
Playing Atari with Deep Reinforcement LearningPlaying Atari with Deep Reinforcement Learning
Playing Atari with Deep Reinforcement Learning
Willy Marroquin (WillyDevNET)
 
Performance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning AlgorithmsPerformance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning Algorithms
Dinusha Dilanka
 
Strategic Oscillation for Exploitation and Exploration of ACS Algorithm for J...
Strategic Oscillation for Exploitation and Exploration of ACS Algorithm for J...Strategic Oscillation for Exploitation and Exploration of ACS Algorithm for J...
Strategic Oscillation for Exploitation and Exploration of ACS Algorithm for J...
University Utara Malaysia
 
Transfer Learning for Improving Model Predictions in Highly Configurable Soft...
Transfer Learning for Improving Model Predictions in Highly Configurable Soft...Transfer Learning for Improving Model Predictions in Highly Configurable Soft...
Transfer Learning for Improving Model Predictions in Highly Configurable Soft...
Pooyan Jamshidi
 

Similar to safe and efficient off policy reinforcement learning (20)

自然方策勾配法の基礎と応用
自然方策勾配法の基礎と応用自然方策勾配法の基礎と応用
自然方策勾配法の基礎と応用
 
Scalable trust-region method for deep reinforcement learning using Kronecker-...
Scalable trust-region method for deep reinforcement learning using Kronecker-...Scalable trust-region method for deep reinforcement learning using Kronecker-...
Scalable trust-region method for deep reinforcement learning using Kronecker-...
 
Accelerated Particle Swarm Optimization and Support Vector Machine for Busine...
Accelerated Particle Swarm Optimization and Support Vector Machine for Busine...Accelerated Particle Swarm Optimization and Support Vector Machine for Busine...
Accelerated Particle Swarm Optimization and Support Vector Machine for Busine...
 
Summary of BRAC
Summary of BRACSummary of BRAC
Summary of BRAC
 
Applications and Analysis of Bio-Inspired Eagle Strategy for Engineering Opti...
Applications and Analysis of Bio-Inspired Eagle Strategy for Engineering Opti...Applications and Analysis of Bio-Inspired Eagle Strategy for Engineering Opti...
Applications and Analysis of Bio-Inspired Eagle Strategy for Engineering Opti...
 
Batch mode reinforcement learning based on the synthesis of artificial trajec...
Batch mode reinforcement learning based on the synthesis of artificial trajec...Batch mode reinforcement learning based on the synthesis of artificial trajec...
Batch mode reinforcement learning based on the synthesis of artificial trajec...
 
5. 8519 1-pb
5. 8519 1-pb5. 8519 1-pb
5. 8519 1-pb
 
CLIM Program: Remote Sensing Workshop, Optimization for Distributed Data Syst...
CLIM Program: Remote Sensing Workshop, Optimization for Distributed Data Syst...CLIM Program: Remote Sensing Workshop, Optimization for Distributed Data Syst...
CLIM Program: Remote Sensing Workshop, Optimization for Distributed Data Syst...
 
Beyond function approximators for batch mode reinforcement learning: rebuildi...
Beyond function approximators for batch mode reinforcement learning: rebuildi...Beyond function approximators for batch mode reinforcement learning: rebuildi...
Beyond function approximators for batch mode reinforcement learning: rebuildi...
 
Accelerated Particle Swarm Optimization and Support Vector Machine for Busine...
Accelerated Particle Swarm Optimization and Support Vector Machine for Busine...Accelerated Particle Swarm Optimization and Support Vector Machine for Busine...
Accelerated Particle Swarm Optimization and Support Vector Machine for Busine...
 
BPstudy sklearn 20180925
BPstudy sklearn 20180925BPstudy sklearn 20180925
BPstudy sklearn 20180925
 
LOGNORMAL ORDINARY KRIGING METAMODEL IN SIMULATION OPTIMIZATION
LOGNORMAL ORDINARY KRIGING METAMODEL IN SIMULATION OPTIMIZATIONLOGNORMAL ORDINARY KRIGING METAMODEL IN SIMULATION OPTIMIZATION
LOGNORMAL ORDINARY KRIGING METAMODEL IN SIMULATION OPTIMIZATION
 
Optimizing a New Nonlinear Reinforcement Scheme with Breeder genetic algorithm
Optimizing a New Nonlinear Reinforcement Scheme with Breeder genetic algorithmOptimizing a New Nonlinear Reinforcement Scheme with Breeder genetic algorithm
Optimizing a New Nonlinear Reinforcement Scheme with Breeder genetic algorithm
 
USING LEARNING AUTOMATA AND GENETIC ALGORITHMS TO IMPROVE THE QUALITY OF SERV...
USING LEARNING AUTOMATA AND GENETIC ALGORITHMS TO IMPROVE THE QUALITY OF SERV...USING LEARNING AUTOMATA AND GENETIC ALGORITHMS TO IMPROVE THE QUALITY OF SERV...
USING LEARNING AUTOMATA AND GENETIC ALGORITHMS TO IMPROVE THE QUALITY OF SERV...
 
AUTOMATIC TRANSFER RATE ADJUSTMENT FOR TRANSFER REINFORCEMENT LEARNING
AUTOMATIC TRANSFER RATE ADJUSTMENT FOR TRANSFER REINFORCEMENT LEARNINGAUTOMATIC TRANSFER RATE ADJUSTMENT FOR TRANSFER REINFORCEMENT LEARNING
AUTOMATIC TRANSFER RATE ADJUSTMENT FOR TRANSFER REINFORCEMENT LEARNING
 
Introduction to Reinforcement Learning for Molecular Design
Introduction to Reinforcement Learning for Molecular Design Introduction to Reinforcement Learning for Molecular Design
Introduction to Reinforcement Learning for Molecular Design
 
Playing Atari with Deep Reinforcement Learning
Playing Atari with Deep Reinforcement LearningPlaying Atari with Deep Reinforcement Learning
Playing Atari with Deep Reinforcement Learning
 
Performance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning AlgorithmsPerformance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning Algorithms
 
Strategic Oscillation for Exploitation and Exploration of ACS Algorithm for J...
Strategic Oscillation for Exploitation and Exploration of ACS Algorithm for J...Strategic Oscillation for Exploitation and Exploration of ACS Algorithm for J...
Strategic Oscillation for Exploitation and Exploration of ACS Algorithm for J...
 
Transfer Learning for Improving Model Predictions in Highly Configurable Soft...
Transfer Learning for Improving Model Predictions in Highly Configurable Soft...Transfer Learning for Improving Model Predictions in Highly Configurable Soft...
Transfer Learning for Improving Model Predictions in Highly Configurable Soft...
 

Recently uploaded

LEARNING TO LIVE WITH LAWS OF MOTION .pptx
LEARNING TO LIVE WITH LAWS OF MOTION .pptxLEARNING TO LIVE WITH LAWS OF MOTION .pptx
LEARNING TO LIVE WITH LAWS OF MOTION .pptx
yourprojectpartner05
 
Mending Clothing to Support Sustainable Fashion_CIMaR 2024.pdf
Mending Clothing to Support Sustainable Fashion_CIMaR 2024.pdfMending Clothing to Support Sustainable Fashion_CIMaR 2024.pdf
Mending Clothing to Support Sustainable Fashion_CIMaR 2024.pdf
Selcen Ozturkcan
 
Male reproduction physiology by Suyash Garg .pptx
Male reproduction physiology by Suyash Garg .pptxMale reproduction physiology by Suyash Garg .pptx
Male reproduction physiology by Suyash Garg .pptx
suyashempire
 
Introduction_Ch_01_Biotech Biotechnology course .pptx
Introduction_Ch_01_Biotech Biotechnology course .pptxIntroduction_Ch_01_Biotech Biotechnology course .pptx
Introduction_Ch_01_Biotech Biotechnology course .pptx
QusayMaghayerh
 
Signatures of wave erosion in Titan’s coasts
Signatures of wave erosion in Titan’s coastsSignatures of wave erosion in Titan’s coasts
Signatures of wave erosion in Titan’s coasts
Sérgio Sacani
 
TOPIC OF DISCUSSION: CENTRIFUGATION SLIDESHARE.pptx
TOPIC OF DISCUSSION: CENTRIFUGATION SLIDESHARE.pptxTOPIC OF DISCUSSION: CENTRIFUGATION SLIDESHARE.pptx
TOPIC OF DISCUSSION: CENTRIFUGATION SLIDESHARE.pptx
shubhijain836
 
23PH301 - Optics - Unit 1 - Optical Lenses
23PH301 - Optics  -  Unit 1 - Optical Lenses23PH301 - Optics  -  Unit 1 - Optical Lenses
23PH301 - Optics - Unit 1 - Optical Lenses
RDhivya6
 
Lattice Defects in ionic solid compound.pptx
Lattice Defects in ionic solid compound.pptxLattice Defects in ionic solid compound.pptx
Lattice Defects in ionic solid compound.pptx
DrRajeshDas
 
Flow chart.pdf LIFE SCIENCES CSIR UGC NET CONTENT
Flow chart.pdf  LIFE SCIENCES CSIR UGC NET CONTENTFlow chart.pdf  LIFE SCIENCES CSIR UGC NET CONTENT
Flow chart.pdf LIFE SCIENCES CSIR UGC NET CONTENT
savindersingh16
 
Holsinger, Bruce W. - Music, body and desire in medieval culture [2001].pdf
Holsinger, Bruce W. - Music, body and desire in medieval culture [2001].pdfHolsinger, Bruce W. - Music, body and desire in medieval culture [2001].pdf
Holsinger, Bruce W. - Music, body and desire in medieval culture [2001].pdf
frank0071
 
Compositions of iron-meteorite parent bodies constrainthe structure of the pr...
Compositions of iron-meteorite parent bodies constrainthe structure of the pr...Compositions of iron-meteorite parent bodies constrainthe structure of the pr...
Compositions of iron-meteorite parent bodies constrainthe structure of the pr...
Sérgio Sacani
 
AJAY KUMAR NIET GreNo Guava Project File.pdf
AJAY KUMAR NIET GreNo Guava Project File.pdfAJAY KUMAR NIET GreNo Guava Project File.pdf
AJAY KUMAR NIET GreNo Guava Project File.pdf
AJAY KUMAR
 
Embracing Deep Variability For Reproducibility and Replicability
Embracing Deep Variability For Reproducibility and ReplicabilityEmbracing Deep Variability For Reproducibility and Replicability
Embracing Deep Variability For Reproducibility and Replicability
University of Rennes, INSA Rennes, Inria/IRISA, CNRS
 
The cost of acquiring information by natural selection
The cost of acquiring information by natural selectionThe cost of acquiring information by natural selection
The cost of acquiring information by natural selection
Carl Bergstrom
 
Evidence of Jet Activity from the Secondary Black Hole in the OJ 287 Binary S...
Evidence of Jet Activity from the Secondary Black Hole in the OJ 287 Binary S...Evidence of Jet Activity from the Secondary Black Hole in the OJ 287 Binary S...
Evidence of Jet Activity from the Secondary Black Hole in the OJ 287 Binary S...
Sérgio Sacani
 
Juaristi, Jon. - El canon espanol. El legado de la cultura española a la civi...
Juaristi, Jon. - El canon espanol. El legado de la cultura española a la civi...Juaristi, Jon. - El canon espanol. El legado de la cultura española a la civi...
Juaristi, Jon. - El canon espanol. El legado de la cultura española a la civi...
frank0071
 
Anti-Universe And Emergent Gravity and the Dark Universe
Anti-Universe And Emergent Gravity and the Dark UniverseAnti-Universe And Emergent Gravity and the Dark Universe
Anti-Universe And Emergent Gravity and the Dark Universe
Sérgio Sacani
 
一比一原版美国佩斯大学毕业证如何办理
一比一原版美国佩斯大学毕业证如何办理一比一原版美国佩斯大学毕业证如何办理
一比一原版美国佩斯大学毕业证如何办理
gyhwyo
 
Post translation modification by Suyash Garg
Post translation modification by Suyash GargPost translation modification by Suyash Garg
Post translation modification by Suyash Garg
suyashempire
 
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
hozt8xgk
 

Recently uploaded (20)

LEARNING TO LIVE WITH LAWS OF MOTION .pptx
LEARNING TO LIVE WITH LAWS OF MOTION .pptxLEARNING TO LIVE WITH LAWS OF MOTION .pptx
LEARNING TO LIVE WITH LAWS OF MOTION .pptx
 
Mending Clothing to Support Sustainable Fashion_CIMaR 2024.pdf
Mending Clothing to Support Sustainable Fashion_CIMaR 2024.pdfMending Clothing to Support Sustainable Fashion_CIMaR 2024.pdf
Mending Clothing to Support Sustainable Fashion_CIMaR 2024.pdf
 
Male reproduction physiology by Suyash Garg .pptx
Male reproduction physiology by Suyash Garg .pptxMale reproduction physiology by Suyash Garg .pptx
Male reproduction physiology by Suyash Garg .pptx
 
Introduction_Ch_01_Biotech Biotechnology course .pptx
Introduction_Ch_01_Biotech Biotechnology course .pptxIntroduction_Ch_01_Biotech Biotechnology course .pptx
Introduction_Ch_01_Biotech Biotechnology course .pptx
 
Signatures of wave erosion in Titan’s coasts
Signatures of wave erosion in Titan’s coastsSignatures of wave erosion in Titan’s coasts
Signatures of wave erosion in Titan’s coasts
 
TOPIC OF DISCUSSION: CENTRIFUGATION SLIDESHARE.pptx
TOPIC OF DISCUSSION: CENTRIFUGATION SLIDESHARE.pptxTOPIC OF DISCUSSION: CENTRIFUGATION SLIDESHARE.pptx
TOPIC OF DISCUSSION: CENTRIFUGATION SLIDESHARE.pptx
 
23PH301 - Optics - Unit 1 - Optical Lenses
23PH301 - Optics  -  Unit 1 - Optical Lenses23PH301 - Optics  -  Unit 1 - Optical Lenses
23PH301 - Optics - Unit 1 - Optical Lenses
 
Lattice Defects in ionic solid compound.pptx
Lattice Defects in ionic solid compound.pptxLattice Defects in ionic solid compound.pptx
Lattice Defects in ionic solid compound.pptx
 
Flow chart.pdf LIFE SCIENCES CSIR UGC NET CONTENT
Flow chart.pdf  LIFE SCIENCES CSIR UGC NET CONTENTFlow chart.pdf  LIFE SCIENCES CSIR UGC NET CONTENT
Flow chart.pdf LIFE SCIENCES CSIR UGC NET CONTENT
 
Holsinger, Bruce W. - Music, body and desire in medieval culture [2001].pdf
Holsinger, Bruce W. - Music, body and desire in medieval culture [2001].pdfHolsinger, Bruce W. - Music, body and desire in medieval culture [2001].pdf
Holsinger, Bruce W. - Music, body and desire in medieval culture [2001].pdf
 
Compositions of iron-meteorite parent bodies constrainthe structure of the pr...
Compositions of iron-meteorite parent bodies constrainthe structure of the pr...Compositions of iron-meteorite parent bodies constrainthe structure of the pr...
Compositions of iron-meteorite parent bodies constrainthe structure of the pr...
 
AJAY KUMAR NIET GreNo Guava Project File.pdf
AJAY KUMAR NIET GreNo Guava Project File.pdfAJAY KUMAR NIET GreNo Guava Project File.pdf
AJAY KUMAR NIET GreNo Guava Project File.pdf
 
Embracing Deep Variability For Reproducibility and Replicability
Embracing Deep Variability For Reproducibility and ReplicabilityEmbracing Deep Variability For Reproducibility and Replicability
Embracing Deep Variability For Reproducibility and Replicability
 
The cost of acquiring information by natural selection
The cost of acquiring information by natural selectionThe cost of acquiring information by natural selection
The cost of acquiring information by natural selection
 
Evidence of Jet Activity from the Secondary Black Hole in the OJ 287 Binary S...
Evidence of Jet Activity from the Secondary Black Hole in the OJ 287 Binary S...Evidence of Jet Activity from the Secondary Black Hole in the OJ 287 Binary S...
Evidence of Jet Activity from the Secondary Black Hole in the OJ 287 Binary S...
 
Juaristi, Jon. - El canon espanol. El legado de la cultura española a la civi...
Juaristi, Jon. - El canon espanol. El legado de la cultura española a la civi...Juaristi, Jon. - El canon espanol. El legado de la cultura española a la civi...
Juaristi, Jon. - El canon espanol. El legado de la cultura española a la civi...
 
Anti-Universe And Emergent Gravity and the Dark Universe
Anti-Universe And Emergent Gravity and the Dark UniverseAnti-Universe And Emergent Gravity and the Dark Universe
Anti-Universe And Emergent Gravity and the Dark Universe
 
一比一原版美国佩斯大学毕业证如何办理
一比一原版美国佩斯大学毕业证如何办理一比一原版美国佩斯大学毕业证如何办理
一比一原版美国佩斯大学毕业证如何办理
 
Post translation modification by Suyash Garg
Post translation modification by Suyash GargPost translation modification by Suyash Garg
Post translation modification by Suyash Garg
 
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
 

safe and efficient off policy reinforcement learning

  • 2. 2 • 岩城諒 – 大阪大学 浅田研究室 D1 – ryo.iwaki@ams.eng.osaka-u.ac.jp – 強化学習 :: 方策探索 :: 自然方策勾配法 自己紹介
  • 3. Safe and Efficient Off-Policy Reinforcement Learning Safe and efficient off-policy reinforcement learning R´emi Munos munos@google.com Google DeepMind Tom Stepleton stepleton@google.com Google DeepMind Anna Harutyunyan anna.harutyunyan@vub.ac.be Vrije Universiteit Brussel Marc G. Bellemare bellemare@google.com Google DeepMind Abstract In this work, we take a fresh look at some old and new algorithms for off-policy, return-based reinforcement learning. Expressing these in a common form, we de- rive a novel algorithm, Retrace( ), with three desired properties: (1) low variance; (2) safety, as it safely uses samples collected from any behaviour policy, whatever its degree of “off-policyness”; and (3) efficiency, as it makes the best use of sam- ples collected from near on-policy behaviour policies. We analyse the contractive nature of the related operator under both off-policy policy evaluation and control settings and derive online sample-based algorithms. To our knowledge, this is the first return-based off-policy control algorithm converging a.s. to Q⇤ without the GLIE assumption (Greedy in the Limit with Infinite Exploration). As a corollary, • Off-Policy 型強化学習 • 学習データを生成する方策が学習したい方策と異なる • 任意の方策に対してサンプル効率良く収益を近似して 価値関数を学習するアルゴリズムの提案
  • 4. 4 • 状態 • 行動 • 状態遷移確率 • 報酬関数 • 方策 • Q 関数 • 遷移演算子 • 行動価値関数 • 学習目的:価値を最大化する方策の獲得 強化学習 エージェント 環境 状態 行動報酬 (P⇡ Q)(x, a) , X x02X X a02A P(x0 |x, a)⇡(a0 |x0 )Q(x0 , a0 ) Q⇡ , X t 0 t (P⇡ )t r, 2 [0, 1) x 2 X a 2 A P : X ⇥ A ! Pr(X) r : X ⇥ A ! [ RMAX, RMAX] ⇡ : X ! Pr(A) Q : X ⇥ A ! R
  • 5. 5 • ベルマン演算子 – 行動価値 が不動点 • ベルマン最適演算子 – 最適価値関数 が不動点 – に対する greedy な方策 → 最適方策 ベルマン方程式たち T ⇡ Q , r + P⇡ Q T Q , r + max ⇡ P⇡ Q Q⇤ Q⇡ Q⇤ ⇡⇤
  • 6. 6 • 学習したい推定方策 とデータを生成した挙動方策 が 異なる Off-policy Convolution Convolution Fully connected Fully connected No input Schematic illustration of the convolutional neural network. The he architecture are explained in the Methods. The input to the neural onsists of an 843 843 4 image produced by the preprocessing lowed by three convolutional layers (note: snaking blue line symbolizes sliding of each filter across input image) and two fully connected layers with a single output for each valid action. Each hidden layer is followed by a rectifier nonlinearity (that is, max 0,xð Þ). RCH LETTER [Mnih+ 15] regression [2]. With a function approximator, the sampled data from the approximated model can be generated by inap- propriate interpolation or extrapolation that improperly up- dates the policy parameters. In addition, if we aggressively derive the analytical gradi- ent of the approximated model to update the poli- cy, the approximated gra- dient might be far from the true gradient of the objective function due to the model approximation error. If we consider using these function approxima- tion methods for high-di- mensional systems like humanoid robots, this problem becomes more serious due to the difficul- ty of approximating high- dimensional dynamics models with a limited amount of data sampled from real systems. On the other hand, if the environ- ment is extremely stochastic, a limited amount of previously acquired data might not be able to capture the real environ- ment’s property and could lead to inappropriate policy up- dates. However, rigid dynamics models, such as a humanoid robot model, do not usually include large stochasticity. There- fore, our approach is suitable for a real robot learning for high- dimensional systems like humanoid robots. Moreover, applying RL to actu since it usually requires many learn cuted in real environments, and t limited. Previous studies used prio signed initial trajectories to apply proved the robot controller’s param We applied our proposed learn oid robot [7] (Figure 13) and show different movement-learning task edge for the cart-pole swing-up nominal trajectory for the basketb The proposed recursive use of improve policies for real robots other policy search algorithms, s gression [11] or information theo might be interesting to investiga work as a future study. Conclusions In this article, we proposed reusi es of a humanoid robot to effici formance. We proposed recurs PGPE method to improve the p proach to cart-pole swing-up tasks. In the former, we introd task environment composed of a tually simulated cart-pole dyna environment, we can potentiall different task environments. N movements of the humanoid ro the cart-pole swing-up. Furthe posed method, the challenging was successfully accomplished. Future work will develop a m learning [28] approach to efficien riences acquired in different target Acknowledgment This work was supported by 23120004, MIC-SCOPE, ``Devel gies for Clinical Application’’ ca AMED, and NEDO. Part of this s KAKENHI Grant 26730141. This NSFC 61502339. References [1] A. G. Kupcsik, M. P. Deisenroth, J. Pe cient contextual policy search for robot m Conf. Artificial Intelligence, 2013. [2] C. E. Rasmussen and C. K. I. William Learning. Cambridge, MA: MIT Press, 2006 [3] C. G. Atkeson and S. Schaal, “Robot lea 14th Int. Conf. Machine Learning, 1997, pp. 12 [4] C. G. Atkeson and J. Morimoto, “Non cies and value functions: A trajectory-base mation Processing Systems, 2002, pp. 1643– Efficiently reusing previous experiences is crucial to improve its behavioral policies without actually interacting with real environments. Figure 13. The humanoid robot CB-i [7]. (Photo courtesy of ATR.) [Sugimoto+ 16] O↵-Policy Actor-Critic nd the value function sponding TD-solution imilar outline to the policy policy gradient 009) and for nonlinear analyze the dynamics = (wt T vt T ), based on volves satisfying seven , p. 64) to ensure con- able equilibrium. rmance of O↵-PAC to s with linear memory 1) Q( ) (called Q- edy-GQ (GQ( ) with Behavior Greedy-GQ Softmax-GQ O↵-PAC Figure 1. Example of one trajectory for each algorithm in the continuous 2D grid world environment after 5,000 milar outline to the olicy policy gradient 09) and for nonlinear nalyze the dynamics = (wt T vt T ), based on olves satisfying seven p. 64) to ensure con- able equilibrium. mance of O↵-PAC to with linear memory 1) Q( ) (called Q- dy-GQ (GQ( ) with Softmax-GQ (GQ( ) he policy in O↵-PAC d in section 2.3. ntain car, a pendulum world. These prob- space and a continu- Softmax-GQ O↵-PAC Figure 1. Example of one trajectory for each algorithm in the continuous 2D grid world environment after 5,000 learning episodes from the behavior policy. O↵-PAC is the only algorithm that learned to reach the goal reliably. The last problem is a continuous grid-world. The state is a 2-dimensional position in [0, 1]2 . The ac- tions are the pairs {(0.0, 0.0), ( .05, 0.0), (.05, 0.0), (0.0, .05), (0.0, .05)}, representing moves in both di- [Degris+ 12] ⇡ µ E µ [·] 6= E ⇡ [·]
  • 7. 7 • Forward view – 真の収益を target に学習したい – できない • Backward view – 過去に経験した状態・行動対を履歴として保持して 現在の報酬を分配 • backward view = forward view [Sutton & Barto 98] Return Based Reinforcement Learning tX s=0 ( )t s I{xj, aj = x, a} Q(xt, at) (1 ↵)Q(xt, at) + ↵ X s t s t (P⇡ )s t r Eligibility Trace X t 0 t (P⇡ )t r rt+1 rt+2 rt+n (xt, at, rt) t n 3 2 (xt n, at n) (xt 1, at 1)
  • 8. 8 • λ 収益 : n ステップ収益の指数平均 • が TD(0) と Monte Carlo 法をつなぐ λ 収益と TD(λ) T ⇡ =0 = T ⇡ Q T ⇡ =1 = Q⇡ T ⇡ , (1 ) X n 0 n [(T ⇡ )n Q] = Q + (I P⇡ ) 1 (T ⇡ Q Q) 2 [0, 1] X t 0 t (P⇡ )t r rt+1 rt+2 rt+n (xt, at, rt) t n 3 2 (xt n, at n) (xt 1, at 1)
  • 9. 9 • 軌跡 の生成確率は 方策ごとに大きく違う – 保持している trace がうそっぱち • 学習サンプルをどう重み付け / 修正すれば 収益を効率良く近似できるか? • 目的 – safe: 任意の と に対して収束する – efficient: サンプル効率よく収益を近似 Off-policy, Return Based ⇡(a|x) µ(a|x) Ft = (x0, a0, r0, x1, . . . , xt, at, rt)
  • 10. 10 Off-Policy Return Operator (Forward View) E ⇡ Q(x, ·) , X a ⇡(a|x)Q(x, a) 軌跡生成確率の重み付け 推定方策で報酬の値を補正 RQ(x, a) , Q(x, a) + E µ 2 4 X t 0 t ⇣ tY s=1 cs ⌘ rt+ E ⇡ Q(xt+1, ·) Q(xt, at) 3 5 • 演算子 を用いて – 関連研究を整理 – 新しいアルゴリズムの提案 R
  • 11. 11 • Importance Sampling(IS) – 単純な重点重み – 報酬の補正はしない ✘ の分散が大きい 関連研究の整理 1 [Precup+ 00; 01; Geist & Scherrer 14] cs = ⇡(as|xs) µ(as|xs) RQ(x, a) , Q(x, a) + E µ 2 4 X t 0 t ⇣ tY s=1 cs ⌘ rt+ E ⇡ Q(xt+1, ·) Q(xt, at) 3 5 RIS Q(x, a) , Q(x, a) + E µ 2 4 X t 0 t ⇣ tY s=1 ⇡(as|xs) µ(as|xs) ⌘ rt 3 5 X t 0 t ⇣ tY s=1 ⇡(as|xs) µ(as|xs) ⌘ ⇡ µ a
  • 12. 12 • Q(λ) – 古典的な eligibility trace – 推定方策で即時報酬を補正 ✘ 軌跡を重み付けしないため と が大きく異なると収束が保証されない 関連研究の整理 2 [Harutyunyan+ 16] cs = RQ(x, a) , Q(x, a) + E µ 2 4 X t 0 t ⇣ tY s=1 cs ⌘ rt+ E ⇡ Q(xt+1, ·) Q(xt, at) 3 5 R Q(x, a) , Q(x, a) + E µ 2 4 X t 0 ( )t h rt + E ⇡ Q(xt+1, ·) Q(xt, at) i 3 5 ⇡(a|x) µ(a|x) max x k⇡(·|x) µ(·|x)k1  1 unsafe
  • 13. 13 • Tree Backup (TB) – 推定方策そのもので軌跡を重み付け ü 挙動方策が非マルコフでも学習可能 ✘ と が近い場合(On-Policy) トレースの減衰が速く,収益を効率良く近似できない 関連研究の整理 3 cs = ⇡(as|xs) [Precup+ 00] RQ(x, a) , Q(x, a) + E µ 2 4 X t 0 t ⇣ tY s=1 cs ⌘ rt+ E ⇡ Q(xt+1, ·) Q(xt, at) 3 5 R Q(x, a) , Q(x, a) + E µ 2 4 X t 0 ( )t ⇣ tY s=1 ⇡(as|xs) ⌘ h rt + E ⇡ Q(xt+1, ·) Q(xt, at) i 3 5 ⇡(a|x) µ(a|x) inefficient
  • 14. 14 • 1で打ち切られた 重点重み ü が有限の値をとる ü On-Policyに近い場合でも trace が減衰しにくい 提案手法 :: Retrace(λ) cs = min ✓ 1, ⇡(as|xs) µ(as|xs) ◆ min ✓ 1, ⇡(as|xs) µ(as|xs) ◆ ⇡(as|xs) RQ(x, a) , Q(x, a) + E µ 2 4 X t 0 t ⇣ tY s=1 cs ⌘ rt+ E ⇡ Q(xt+1, ·) Q(xt, at) 3 5 V ✓X t t tY s=1 cs ◆ ⇡ µ a
  • 15. 15 関連研究との比較 Definition Estimation Guaranteed Use full returns of cs variance convergence† (near on-policy) Importance sampling ⇡(as|xs) µ(as|xs) High for any ⇡, µ yes Q( ) Low only for ⇡ close to µ yes TB( ) ⇡(as|xs) Low for any ⇡, µ no Retrace( ) min(1, ⇡(as|xs) µ(as|xs) ) Low for any ⇡, µ yes Table 1: Properties of several algorithms defined in terms of the general operator given in (4). †Guaranteed convergence of the expected operator R. • The online Retrace( ) algorithm converges a.s. to Q⇤ in the control case. In the control Safe Efficient RQ(x, a) , Q(x, a) + E µ 2 4 X t 0 t ⇣ tY s=1 cs ⌘ rt+ E ⇡ Q(xt+1, ·) Q(xt, at) 3 5
  • 16. 16 • 任意の挙動方策 と increasingly greedy な推定方策 に対して, trace がマルコフ性を満たし 悲観的初期値 であれば • さらに,推定方策が greedy 方策に漸近すれば Retrace(λ) の収束 µk ⇡k kRkQk Q⇤ k  kQk Q⇤ k + "kkQkk Qk ! Q⇤ ("k ! 0) cs = c(as, xs) 2  0, ⇡(as|xs) µ(as|xs) T ⇡0 Q0 Q0
  • 17. 17 • 不動点 の存在を示す • が縮小写像であることを示す 不動点定理による収束証明 R Q⇤ kRkQk Q⇤ k  kQk Q⇤ k R Q Q R Q Q⇤ Q⇤ Q⇤ Q = Q⇤ ⇡ cµ 0cs 2  0, ⇡(as|xs) µ(as|xs)
  • 18. 18 • Off-Policy Return Operator (forward view) • Algorithm (backward view) Forward View = Backward View Qk+1(x, a) Qk(x, a) + ↵k X t s ⇡k t tX j=s t j ⇣ tY i=j+1 ci ⌘ I{xj, aj = x, a} ⇡k t = rt + E ⇡k Qk(xt+1, ·) Qk(xt, at) Qk+1(x, a) = (1 ↵k)Qk(x, a) + ↵kRkQk(x, a) RkQk(x, a) = Qk(x, a) + E µk 2 4 X t 0 t ⇣ tY s=1 cs ⌘ rt + E ⇡k Qk(xt+1, ·) Qk(xt, at) 3 5 =
  • 19. 19 • CNN による Q の近似 • 16スレッド並列の学習 1. θ’ ← θ 2. 各スレッドでθ’から勾配を計算 3. 全ての勾配を利用してθを更新 • Experience Replay – 一度の更新に使うサンプル数は揃える (64) – Retrace, TB, Q* では (連続した16の状態遷移 x 4) 実験 :: Atari 2600 + DQN difficult and engaging for human players. We used the same network architecture, hyperparameter values (see Extended Data Table 1) and learningprocedurethroughout—takinghigh-dimensionaldata(210|160 colour video at 60 Hz) as input—to demonstrate that our approach robustly learns successful policies over a variety of games based solely onsensoryinputswithonlyveryminimalpriorknowledge(thatis,merely the input data were visual images, and the number of actions available in each game, but not their correspondences; see Methods). Notably, our method was able to train large neural networks using a reinforce- mentlearningsignalandstochasticgradientdescentinastablemanner— illustrated by the temporal evolution of two indices of learning (the agent’s average score-per-episode and average predicted Q-values; see Fig. 2 and Supplementary Discussion for details). We compared DQN with the best performing meth reinforcement learning literature on the 49 games whe available12,15 . In addition to the learned agents, we alsore aprofessionalhumangamestesterplayingundercontro and a policy that selects actions uniformly at random (E Table 2 and Fig. 3, denoted by 100% (human) and 0% ( axis; see Methods). Our DQN method outperforms th reinforcement learning methods on 43 of the games wit rating any of the additional prior knowledge about Ata used by other approaches (for example, refs 12, 15). Fur DQN agent performed at a level that was comparable to fessionalhumangamestesteracrossthesetof49games,a than75%ofthe humanscore onmorethanhalfofthegam Convolution Convolution Fully connected Fully connected No input Figure 1 | Schematic illustration of the convolutional neural network. The details of the architecture are explained in the Methods. The input to the neural network consists of an 843 843 4 image produced by the preprocessing map w, followed by three convolutional layers (note: snaking blue line symbolizes sliding of each filter across input image) and two f layers with a single output for each valid action. Each hidden by a rectifier nonlinearity (that is, max 0,xð Þ). a b 0 200 400 600 800 1,000 1,200 1,400 1,600 1,800 2,000 2,200 Averagescoreperepisode 0 1,000 2,000 3,000 4,000 5,000 6,000 Averagescoreperepisode RESEARCH LETTER [Mnih+ 15] [Mnih+ 16] DQN Q(xt, at; ✓) = r(xt, at) + max a0 Q(xt+1, a0 ; ✓ ) Q(xt, at; ✓) ✓ ✓ + ↵t Q(xt, at; ✓) @Q(xt, at; ✓) @✓ Q(xt, at; ✓) = k 1X s=t s t ⇣ sY i=t+1 ci ⌘⇥ r(xs, as) + E ⇡ Q(xs+1, ·; ✓ ) Q(xs, as; ✓) ⇤
  • 20. 20 • 60 種類のゲームでのパフォーマンスを比較 • : ゲーム g におけるアルゴリズム a のスコア 実験結果 0.00.20.40.60.81.0 0.0 0.2 0.4 0.6 0.8 1.0 FractionofGames Inter-algorithm Score 40M TRAINING FRAMES Q* Retrace Tree-backup Q-Learning 0.00.20.40.60.81.0 0.0 0.2 0.4 0.6 0.8 1.0 FractionofGames Inter-algorithm Score 200M TRAINING FRAMES Retrace Tree-backup Q-Learning f(x) , |{g : zg,a x}|/60 zg,a 2 [0, 1]
  • 21. 21 • 既存の return based off-policy 手法に対して 統一的な見方を与えた. • 任意の挙動方策に対するサンプル効率の良い return based off-policy を提案した. – トレース係数を1で打ち切ることで分散を抑えた • Atari 2600 において,既存の return based off-policy 手法よりも 高い学習性能を示した. まとめ
  • 22. 22 • c が必ず 1 より小さいため trace の減衰が速い. • 分散を小さく保ちながら最大の trace を利用するためには 例えば • しかし c が非マルコフ的になるため収束の保証が崩れる. Future Work cs = min ✓ 1 c1 . . . cs 1 , ⇡(as|xs) µ(as|xs) ◆