Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Recent rl

3,142 views

Published on

最近の強化学習の研究の流れ

Published in: Technology
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Recent rl

  1. 1. Qo (s,a) = r(s,a)+γ max a' Qo (s',a') Qo L = (r(s,a)+γ max a' Qθ o (s',a')−Qθ o (s,a))2
  2. 2. ∇θ J = ∇θ Eπθ [ γ τ Rτ ] τ =0 ∞ ∑ = ∇θ P( ′s | st ,a)πθ (a | st ) γ τ Rτ τ =0 ∞ ∑ a ∑ ′s ∑ = P( ′s | st ,a)∇θπθ (a | st ) γ τ Rτ τ =0 ∞ ∑ a ∑ ′s ∑ = P( ′s | st ,a)πθ (a | st ) ∇θπθ (a | st ) πθ (a | st ) γ τ Rτ τ =0 ∞ ∑ a ∑ ′s ∑ = P( ′s | st ,a)πθ (a | st )∇θ log(πθ (a | st )) γ τ Rτ τ =0 ∞ ∑ a ∑ ′s ∑ = Eπθ [∇θ log(πθ (a | st )) γ τ Rτ ] τ =0 ∞ ∑
  3. 3. Eπθ [∇θ log(πθ (a | st )) γ τ Rτ ] τ =0 ∞ ∑ = 1 M ∇θ log(πθ (ai T | si T ))( γ τ Rτ T ) τ =0 ∞ ∑ i ∑ T ∑ T = s0 T ,a0 T ,r0 T ,!sn T ,an T ,rn T
  4. 4. 1 M ∇θ log(πθ (ai T | si T ))( γ τ Rτ T τ =0 ∞ ∑ i ∑ T ∑ )
  5. 5. 1 M ∇θ log(πθ (ai T | si T ))( γ τ Rτ T τ =0 ∞ ∑ i ∑ T ∑ ) 1 M ∇θ log(πθ (ai T | si T )) i ∑ T ∑ A(si T ,ai T )
  6. 6. Qaux (a,i, j) LQ = E[(Rt:t+n +γ n max a' Q(s',a';θ− )−Q(s,a;θ))2 ]
  7. 7. LVR = Eπ [(Rt:t+n +γ n V(st+n+1,θ− )−V(st ,θ))2 ]
  8. 8. Ep[ f (x)] = p(x) f (x)x∑ Eq[ f (x)] = q(x) f (x)x∑ = q(x) p(x) p(x) f (x)x∑ = p(x) q(x) p(x) f (x)x∑ = Ep[ q(x) p(x) f (x)]
  9. 9. LA3C = Lπ + LV − Es∼π [αH(π(⋅| s))]
  10. 10. !Qπ (s,a) = α(log(π(s,a)+ Hπ (s))+Vπ (s)
  11. 11. Q∗ (s,a) = r(s,a)+γτ log exp(Q∗ (s',a') /τ )a'∑ Q∗
  12. 12. V∗ (s) = −τ logπ∗ (a | s)+ r(s,a)+γV∗ (s') −V∗ (s1)+γ t−1 V∗ (st )+ R(s1:t )−τG(s1:t ,π∗ ) = 0 R(sm:n ) = γ i r(sm+i ,am+i ) i=0 n−m−1 ∑ G(sm:n,π) = γ i logπ(am+i | sm+i ) i=0 n−m−1 ∑
  13. 13. Cθ,φ (s1:t ) = −Vφ (s1)+γ t−1 Vφ (st )+ R(s1:t )−τG(s1:t ,πθ ) Δθ ∝Cθ,φ (s1:t )∇θG(s1:t ,πθ ) Δφ ∝Cθ,φ (s1:t )(∇φVφ (s1)− ∇φγ t−1 Vφ (st ))

×