論文情報
• タイトル
– AsynchronousMethods for Deep Reinforcement Learning
– URL : https://arxiv.org/abs/1602.01783
• 発表学会
– ICML2016
• 著者
– Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirz
• 所属
– Google DeepMind・Montreal Institute for Learning
Algorithms (MILA), University of Montreal
13
Algorithm 1 20
ParameterServer θ
thread 1
Environment Network
Gradients
Learner with A3C
Loss
Actor Memory
thread k
Environment Network
Gradients
Learner with A3C
Loss
Actor
Memory
パラメータサーバから重
みをコピー
Parameter Server θ
Network
21.
Algorithm 2 21
ParameterServer θ
thread 1
Environment Network
Gradients
Learner with A3C
Loss
Actor Memory
thread k
Environment Network
Gradients
Learner with A3C
Loss
Actor
Memory
メモリに経験を貯める
(tmax or Doneまで)
Parameter Server θ
Network
22.
Algorithm 3 22
ParameterServer θ
thread 1
Environment Network
Gradients
Learner with A3C
Loss
Actor Memory
thread k
Environment Network
Gradients
Learner with A3C
Loss
Actor
Memory
MemoryからLossを計算し勾
配を求める
Network
23.
Algorithm 4 23
thread1
Environment Network
Gradients
Learner with A3C
Loss
Actor Memory
thread k
Environment Network
Gradients
Learner with A3C
Loss
Actor
Memory
Parameter Server θ
Network
非同期に勾配をServerに渡して,
Serverのネットワークを更新
1に戻るをTmax繰り返す
Algorithm 1 38
ParameterServer θ
Shared
Network π
actor 1
Environment
Actor
Advantage
Memory
Actorはshared
Networkの出力
で行動する
Shared
Network R
Networkは途中から二股分岐になって
いることが多い. π側の出力サイズは行
動空間・Rは1次元(スカラ値)が一般的
actor k
Environment
Actor
Advantage
Memory
minibatch
39.
Algorithm 2 39
ParameterServer θ
Shared
Network π
actor 1
Environment
Actor
Advantage
Memory
Shared
Network R
actor k
Environment
Actor
Advantage
Memory
minibatch
Advantageを計算した上で
Memoryに格納(Tまで or Episode
終わるまで)
40.
Algorithm 3 40
ParameterServer θ
Shared
Network π
actor 1
Environment
Actor
Advantage
Memory
Shared
Network R
actor k
Environment
Actor
Advantage
Memory
minibatch
すべてのActorが1Episode終わっ
たら,Advantage Memoryから
minibachを作ってこれで勾配更新