Learning Continuous Control Policies by Stochastic Value Gradients

Learning Continuous Control Policies by
Stochastic Value Gradients
NIPS2015 読会
藤田康博
Preferred Networks Inc.
January 20, 2016

話人
▶ 藤田康博
▶ Preferred Networks Inc.
▶ Twitter: @mooopan
▶ GitHub: muupan
▶ 強化学習・ AI 興味
▶ 最近仕事（発表関係）
(https://twitter.com/hillbig/status/684813252484698112)

読論文
▶ Learning Continuous Control Policies by Stochastic Value
Gradients
▶ Nicolas Heess, Greg Wayne, David Silver, Timothy
Lillicrap, Yuval Tassa, Tom Erez (Google DeepMind)
▶ 強化学習状態・行動連続値取確率的
制御問題扱提案
▶ ・価値関数・policy NN 表
▶ reparameterization trick 使

動画
▶ https://www.youtube.com/watch?v=PYdL7bcn_cM

問題設定
▶ Markov Decision Process
▶ 状態 st ∈ RNS
▶ 行動 at ∈ RNA
▶ 初期状態分布 s0 ∼ p0(·)
▶ 遷移分布 st+1 ∼ p(·|st, at)
▶ st
at
確率的 st+1
決
▶ 報酬関数 rt = r(st, at, t)（時間依存）
▶ 求
▶ （確率的）policy at ∼ p(·|st; θ)
▶ st
確率的 at
決
▶ 最大化
▶ 報酬和期待値 J(θ) = E[
∑T
t=0 γtrt|θ]
▶ γ ∈ [0, 1] 割引率

表記関注意
▶ 下付文字偏微分表
▶ πθ = ∂π
∂θ
▶ 「 θ 表 π」
▶ （ 1 箇所 πθ 後者意味使場所 …）
▶ 上付文字時間指数
▶ 報酬和期待値（再掲） J(θ) = E[
∑T
t=0 γtrt|θ]
▶ rt
時間 t 報酬
▶ γt
γ t 乗
▶ 時間依存判断 …

行動連続値
▶ 「DQN 駄目？」
▶ DQN [Mnih et al. 2013; Mnih et al. 2015] 状態行動価
値 Q(s, a; θ) 学習，行動 arg maxa Q(s, a; θ) 選択
▶ a 連続値 arg max 求！
▶ policy 直接（NN ）表
▶ at ∼ p(·|st; θ)
▶ 行動選際
▶ θ 更新方法，論文 policy
gradient methods 種類方法扱

Policy Gradient Methods
▶ 目標：J(θ) = E[
∑T
t=0 γt
rt
|θ] 最大化 policy
θ 求
▶ ∇θJ(θ)（policy gradient）求
▶ 求勾配法 policy 最適化（policy
gradient methods）
▶ 求？
▶ likelihood ratio methods
▶ value gradient methods

Likelihood Ratio Methods (2)
▶ 使 ∇θJ(θ) 推定
[Williams 1992; Sutton et al. 1999]
∇θJ(θ) = Es∼ρπ,a∼p(·|s;θ)[Q(s, a)∇θ log p(a|s; θ)]
▶ policy gradient 求方法広使
▶ 欠点
▶ Q(s, a) 勾配情報使
▶ variance 大

Deterministic Value Gradients (1)
▶ Backpropagation 価値関数勾配（value gradient）
求（value gradient methods）
▶ J(θ) = Es0∼p0 V 0(s0) V 0
θ 計算良
▶ MDP policy 決定的（s′
= f (s, a), a = π(s)），
決定的 Bellman 方程式
V (s) = r(s, π(s)) + γV ′
(f (s, π(s)))
微分 value gradient 計算
Vs = rs + raπs + γV ′
s′ (fs + faπs) (3)
Vθ = raπθ + γV ′
s′ faπθ + γV ′
θ (4)
= Qaπθ + γV ′
θ

▶ 式 (3)，(4) 系列 (s0
, a0
, s1
, a1
, . . . ) V 0
θ (s0
)
RNN 計算

▶ 欠点
▶ 確率的 MDP policy 扱
▶ 異区別状態（state aliasing）確
率的 policy 必要
▶ 例：灰色状態区別場合，決
定的 policy 開始地点一生金辿
▶ reparameterization trick 解決

Reparameterization Trick
▶ ∇x Ep(y|x)g(y) 求別方法
▶ p(y|x) 決定的関数 f
変数 ξ 使書：y = f (x, ξ), ξ ∼ ρ(·)
▶ 例：p(y|x) = N(µ(x), σ2(x))
y = µ(x) + σ(x)ξ, ξ ∼ N(0, 1)
▶ 微分
∇x Ep(y|x)g(y) = Eρ(ξ)gy fx
≈
1
M
M∑
i=0
gy fx |ξ=ξi (5)
▶ likelihood ratio methods 異 g 勾配情報使
，variance 低

Stochastic Value Gradients
▶ 遷移分布 s′
= f (s, a, ξ)，policy a = π(s, η; θ)
reparameterize
Vs = Eρ(η)[rs + raπs + γEρ(ξ)V ′
s′ (fs + faπs)] (7)
Vθ = Eρ(η)[raπθ + γEρ(ξ)[V ′
s′ faπθ + γV ′
θ]] (8)
= Eρ(η)[Qaπθ + γV ′
θ]
▶ MDP 確率的，policy 確率的，value
gradient 求！（stochastic value gradient）

復元
▶ 遷移関数 f 実際未知学習
，ˆs′
= ˆf (s, a, ξ) 実際観測 s′
使
勾配計算
▶ f 予測誤差影響抑
▶ 昔 θk
使選 ak
= π(sk
, η; θk
)
使，今 θt
勾配計算
▶ experience replay（経験再利用）可能
▶ 結果復元必要
ξ ∼ p(ξ|s, a, s′
), η ∼ p(η|s, a)
▶ Gaussian 場合 η = (ak − µ(sk))/σ(sk) 求
（著者確認）

3種類
▶ value gradient 求方異 3 種類
提案
▶ SVG(∞)
▶ SVG(1)
▶ SVG(0)

SVG(∞)
▶ 遷移関数 ˆf (s, a, ξ) policy π(s, η) 一緒学習

SVG(1)
▶ 遷移関数 ˆf (s, a, ξ) policy π(s, η) ˆV (s) 一緒学習
▶ ˆf 1 使残 ˆV 使
▶ experience replay 使場合特 SVG(1)-ER 表記

SVG(0)
▶ policy π(s, η) ˆQ(s, a) 一緒学習
▶ 遷移関数使

評価
▶ AC [Wawrzynski 2009]，DPG [Silver et al. 2014] 既存
手法（ policy value function 学習）
▶ SVG(1)-ER 総良

悪化場合
▶ ˆf 隠層次元数減評価
▶ SVG(∞) 性能大劣化，SVG(1)
変

価値関数悪化場合
▶ 価値関数隠層次元数減評価
▶ DPG 性能大劣化，SVG(1) 影響

▶ likelihood ratio methods 代 reparameterization
trick 使
▶ 確率的 MDP，確率的 policy 対 value
gradient 計算（stochastic value gradients）
▶ 提案実験 SVG(1)-ER 良
性能

感想
▶ reparameterization trick 便利
▶ likelihood ratio methods 代使使
▶ 行動離散的 reparameterization trick 使
likelihood ratio methods 頼無？
▶ SVG(0)-ER 評価気
▶ experience replay 重要

参考文献 I
[1] Volodymyr Mnih et al. “Human-level control through deep reinforcement
learning”. In: Nature 518.7540 (2015), pp. 529–533.
[2] Volodymyr Mnih et al. “Playing Atari with Deep Reinforcement Learning”. In:
NIPS 2014 Deep Learning Workshop. 2013, pp. 1–9. arXiv:
arXiv:1312.5602v1.
[3] David Silver et al. “Deterministic Policy Gradient Algorithms”. In: ICML 2014.
2014, pp. 387–395.
[4] Richard S. Sutton et al. “Policy Gradient Methods for Reinforcement Learning
with Function Approximation”. In: In Advances in Neural Information
Processing Systems 12. 1999, pp. 1057–1063.
[5] Pawel Wawrzynski. “Real-time reinforcement learning by sequential
Actor-Critics and experience replay”. In: Neural Networks 22.10 (2009),
pp. 1484–1497.
[6] RJ Williams. “Simple statistical gradient-following algorithms for connectionist
reinforcement learning”. In: Reinforcement Learning 8.3-4 (1992), pp. 229–256.

Learning Continuous Control Policies by Stochastic Value Gradients

More Related Content

What's hot

Similar to Learning Continuous Control Policies by Stochastic Value Gradients

More from mooopan

Learning Continuous Control Policies by Stochastic Value Gradients