SlideShare a Scribd company logo
St+1 ~ P( ′s | St ,At )
rt+1 = r(St ,At ,St+1)
At ~ π( ′a | St )
St+1 ~ P( ′s | St ,At )
rt+1 = r(St ,At ,St+1)
At ~ π( ′a | St )
π∗
= argmax
π
Eπ [ γ τ
rτ ]
τ =0
∞
∑
π∗
= argmax
π
Eπ [ γ τ
rτ ]
τ =0
∞
∑
= J
∇θ J
∇θ J = Eπθ
[∇θ log(πθ (at | st ))Qt ]
∇θ J = Es∼ρ ∇aQµ
s,a( )a=µθ s( )
∇θ µθ s( )⎡
⎣⎢
⎤
⎦⎥
∇θ J = ∇θ Eπθ
[ γ τ
rτ ]
τ =0
∞
∑
= ∇θ Es0 ~ρ,s'~p πθ at ,st( ) γ τ
rτ
τ =0
∞
∑t=0
∏
⎡
⎣
⎢
⎤
⎦
⎥
= Es0 ~ρ,s'~p ∇θ πθ at ,st( ) γ τ
rτ
τ =0
∞
∑t=0
∏
⎡
⎣
⎢
⎤
⎦
⎥
= Es~ρ πθ at ,st( )
∇θ πθ at ,st( )
t=0
∏
πθ at ,st( )
t=0
∏
γ τ
rτ
τ =0
∞
∑
t=0
∏
⎡
⎣
⎢
⎢
⎢
⎤
⎦
⎥
⎥
⎥
= Es~ρ πθ (at | st ) ∇θ log(πθ (at | st ))
t=0
∑t=0
∏ γ τ
rτ
τ =0
∞
∑
⎡
⎣
⎢
⎤
⎦
⎥
= Eπθ
[ ∇θ log(πθ (at | st ))
t=0
∑ γ τ
rτ
τ =t
∞
∑ ]
∇log p x( )( ) f x( )
∇log p x( )( ) f x( )
J = Es∼ρ [Qµθ
s,µθ s( )( )]
∇θ J = Es∼ρ ∇θQµ
s,µθ s( )( )⎡⎣ ⎤⎦
= Es∼ρ ∇aQµ
s,a( )a=µθ s( )
∇θ µθ s( )⎡
⎣⎢
⎤
⎦⎥
f st ,at( )= f st ,at( )+ ∇a f st ,a( )a=at
at − at( )
∇θ J = Eρ,π ∇θ logπθ at st( ) Q st ,at( )− f st ,at( )( )⎡
⎣
⎤
⎦ + Eρ,π ∇θ logπθ at st( ) f st ,at( )⎡
⎣
⎤
⎦
= Eρ,π ∇θ logπθ at st( ) Q st ,at( )− f st ,at( )( )⎡
⎣
⎤
⎦ + Eρ,π ∇a f st ,a( )a=at
∇θ µθ st( )⎡
⎣
⎤
⎦
∇θ J = Eρ,π ∇θ logπθ at st( ) Q st ,at( )−Qw st ,at( )( )⎡
⎣
⎤
⎦ + Eρ,π ∇aQw st ,a( )a=at
∇θ µθ st( )⎡
⎣
⎤
⎦
∇θ J = Eρ,π ∇θ logπθ at st( ) A st ,at( )− Aw st ,at( )( )⎡
⎣
⎤
⎦ + Eρ,π ∇aQw st ,a( )a=at
∇θ µθ st( )⎡
⎣
⎤
⎦
a
∇θ J = Eρ,π ∇θ logπθ at st( ) A st ,at( )− Aw st ,at( )( )⎡
⎣
⎤
⎦ + Eρ,π ∇aQw st ,a( )a=at
∇θ µθ st( )⎡
⎣
⎤
⎦
Aw = Qw st ,at( )− Eπ Qw st ,at( )⎡⎣ ⎤⎦
= Qw st ,µθ st( )( )+ ∇aQw st ,a( )a=µθ st( )
at − µθ st( )( )− Eπ Qw st ,µθ st( )( )+ ∇aQw st ,a( )a=µθ st( )
at − µθ st( )( )⎡
⎣⎢
⎤
⎦⎥
= ∇aQw st ,a( )a=µθ st( )
at − µθ st( )( )
rt+1 +γV st+1( )−V st( )
Eπ at[ ]= µθ st( )
m*
= m −η(t −τ )
E m*
⎡⎣ ⎤⎦ = E m[ ]
Var m*
⎡⎣ ⎤⎦ = Var m[ ]− 2ηCov m,t[ ]+η2
Var t[ ]
η*
=
Cov m,t[ ]
Var t[ ]
∇θ J = Eρ,π ∇θ logπθ at st( ) A st ,at( )−η st( )Aw st ,at( )( )⎡
⎣
⎤
⎦ +
Eρ,π η st( )∇aQw st ,a( )a=at
∇θ µθ st( )⎡
⎣
⎤
⎦
Var A −ηAw⎡⎣ ⎤⎦ = Var A[ ]− 2ηCov A,Aw( )+η2
Var Aw( )
η*
=
Cov A,Aw( )
Var Aw( )
Q prop
Q prop
Q prop
Q prop
Q prop
Q prop
Q prop
Q prop

More Related Content

Viewers also liked

Identification of associations between genotypes and longitudinal phenotypes ...
Identification of associations between genotypes and longitudinal phenotypes ...Identification of associations between genotypes and longitudinal phenotypes ...
Identification of associations between genotypes and longitudinal phenotypes ...
弘毅 露崎
 
論文輪読資料「Multi-view Face Detection Using Deep Convolutional Neural Networks」
論文輪読資料「Multi-view Face Detection Using Deep Convolutional Neural Networks」論文輪読資料「Multi-view Face Detection Using Deep Convolutional Neural Networks」
論文輪読資料「Multi-view Face Detection Using Deep Convolutional Neural Networks」
Kaoru Nasuno
 

Viewers also liked (20)

Semi-Supervised Classification with Graph Convolutional Networks @ICLR2017読み会
Semi-Supervised Classification with Graph Convolutional Networks @ICLR2017読み会Semi-Supervised Classification with Graph Convolutional Networks @ICLR2017読み会
Semi-Supervised Classification with Graph Convolutional Networks @ICLR2017読み会
 
ICLR2017読み会 Data Noising as Smoothing in Neural Network Language Models @Dena
ICLR2017読み会 Data Noising as Smoothing in Neural Network Language Models @DenaICLR2017読み会 Data Noising as Smoothing in Neural Network Language Models @Dena
ICLR2017読み会 Data Noising as Smoothing in Neural Network Language Models @Dena
 
ICLR読み会 奥村純 20170617
ICLR読み会 奥村純 20170617ICLR読み会 奥村純 20170617
ICLR読み会 奥村純 20170617
 
SwiftでRiemann球面を扱う
SwiftでRiemann球面を扱うSwiftでRiemann球面を扱う
SwiftでRiemann球面を扱う
 
エフェクト用 Shader 機能紹介
エフェクト用 Shader 機能紹介エフェクト用 Shader 機能紹介
エフェクト用 Shader 機能紹介
 
エンジニアがデザインやってみた @ Aimning MeetUp 2017/10
エンジニアがデザインやってみた @ Aimning MeetUp 2017/10エンジニアがデザインやってみた @ Aimning MeetUp 2017/10
エンジニアがデザインやってみた @ Aimning MeetUp 2017/10
 
エフェクトにしっかり色を付ける方法
エフェクトにしっかり色を付ける方法エフェクトにしっかり色を付ける方法
エフェクトにしっかり色を付ける方法
 
当たり前を当たり前に:Agile2017レポート
当たり前を当たり前に:Agile2017レポート当たり前を当たり前に:Agile2017レポート
当たり前を当たり前に:Agile2017レポート
 
Proof summit 2017 for slideshare
Proof summit 2017 for slideshareProof summit 2017 for slideshare
Proof summit 2017 for slideshare
 
Identification of associations between genotypes and longitudinal phenotypes ...
Identification of associations between genotypes and longitudinal phenotypes ...Identification of associations between genotypes and longitudinal phenotypes ...
Identification of associations between genotypes and longitudinal phenotypes ...
 
Continuous control
Continuous controlContinuous control
Continuous control
 
論文輪読資料「Multi-view Face Detection Using Deep Convolutional Neural Networks」
論文輪読資料「Multi-view Face Detection Using Deep Convolutional Neural Networks」論文輪読資料「Multi-view Face Detection Using Deep Convolutional Neural Networks」
論文輪読資料「Multi-view Face Detection Using Deep Convolutional Neural Networks」
 
【論文紹介】Reward Augmented Maximum Likelihood for Neural Structured Prediction
【論文紹介】Reward Augmented Maximum Likelihood for Neural Structured Prediction【論文紹介】Reward Augmented Maximum Likelihood for Neural Structured Prediction
【論文紹介】Reward Augmented Maximum Likelihood for Neural Structured Prediction
 
共変戻り値型って知ってますか?
共変戻り値型って知ってますか?共変戻り値型って知ってますか?
共変戻り値型って知ってますか?
 
Node and Micro-Services at IBM
Node and Micro-Services at IBMNode and Micro-Services at IBM
Node and Micro-Services at IBM
 
Effective web performance tuning for smartphone
Effective web performance tuning for smartphoneEffective web performance tuning for smartphone
Effective web performance tuning for smartphone
 
Googleのインフラ技術から考える理想のDevOps
Googleのインフラ技術から考える理想のDevOpsGoogleのインフラ技術から考える理想のDevOps
Googleのインフラ技術から考える理想のDevOps
 
RのffでGLMしてみたけど...
RのffでGLMしてみたけど...RのffでGLMしてみたけど...
RのffでGLMしてみたけど...
 
ディープボルツマンマシン入門
ディープボルツマンマシン入門ディープボルツマンマシン入門
ディープボルツマンマシン入門
 
FINAL FANTASY Record Keeper の作り方
FINAL FANTASY Record Keeper の作り方FINAL FANTASY Record Keeper の作り方
FINAL FANTASY Record Keeper の作り方
 

Similar to Q prop

ゲーム理論NEXT 期待効用理論第10/11回 -期待効用定理の証明4/5
ゲーム理論NEXT 期待効用理論第10/11回 -期待効用定理の証明4/5ゲーム理論NEXT 期待効用理論第10/11回 -期待効用定理の証明4/5
ゲーム理論NEXT 期待効用理論第10/11回 -期待効用定理の証明4/5
ssusere0a682
 
ゲーム理論NEXT 期待効用理論第6回 -3つの公理と期待効用定理-
ゲーム理論NEXT 期待効用理論第6回 -3つの公理と期待効用定理-ゲーム理論NEXT 期待効用理論第6回 -3つの公理と期待効用定理-
ゲーム理論NEXT 期待効用理論第6回 -3つの公理と期待効用定理-
ssusere0a682
 
知識グラフの埋め込みとその応用 (第10回ステアラボ人工知能セミナー)
知識グラフの埋め込みとその応用 (第10回ステアラボ人工知能セミナー)知識グラフの埋め込みとその応用 (第10回ステアラボ人工知能セミナー)
知識グラフの埋め込みとその応用 (第10回ステアラボ人工知能セミナー)
STAIR Lab, Chiba Institute of Technology
 
ゲーム理論 BASIC 演習73 -3人ゲーム分析:シャープレイ値-
ゲーム理論 BASIC 演習73 -3人ゲーム分析:シャープレイ値-ゲーム理論 BASIC 演習73 -3人ゲーム分析:シャープレイ値-
ゲーム理論 BASIC 演習73 -3人ゲーム分析:シャープレイ値-
ssusere0a682
 
Maximum Likelihood Estimation of Beetle
Maximum Likelihood Estimation of BeetleMaximum Likelihood Estimation of Beetle
Maximum Likelihood Estimation of Beetle
Liang Kai Hu
 

Similar to Q prop (20)

Control as Inference (強化学習とベイズ統計)
Control as Inference (強化学習とベイズ統計)Control as Inference (強化学習とベイズ統計)
Control as Inference (強化学習とベイズ統計)
 
ゲーム理論NEXT 期待効用理論第10/11回 -期待効用定理の証明4/5
ゲーム理論NEXT 期待効用理論第10/11回 -期待効用定理の証明4/5ゲーム理論NEXT 期待効用理論第10/11回 -期待効用定理の証明4/5
ゲーム理論NEXT 期待効用理論第10/11回 -期待効用定理の証明4/5
 
強化学習勉強会6の資料
強化学習勉強会6の資料強化学習勉強会6の資料
強化学習勉強会6の資料
 
ゲーム理論NEXT 戦略形協力ゲーム第11回 -寡占市場ゲームにおける結託耐性ナッシュ均衡-
ゲーム理論NEXT 戦略形協力ゲーム第11回 -寡占市場ゲームにおける結託耐性ナッシュ均衡-ゲーム理論NEXT 戦略形協力ゲーム第11回 -寡占市場ゲームにおける結託耐性ナッシュ均衡-
ゲーム理論NEXT 戦略形協力ゲーム第11回 -寡占市場ゲームにおける結託耐性ナッシュ均衡-
 
関西NIPS+読み会発表スライド
関西NIPS+読み会発表スライド関西NIPS+読み会発表スライド
関西NIPS+読み会発表スライド
 
ゲーム理論NEXT 期待効用理論第6回 -3つの公理と期待効用定理-
ゲーム理論NEXT 期待効用理論第6回 -3つの公理と期待効用定理-ゲーム理論NEXT 期待効用理論第6回 -3つの公理と期待効用定理-
ゲーム理論NEXT 期待効用理論第6回 -3つの公理と期待効用定理-
 
知識グラフの埋め込みとその応用 (第10回ステアラボ人工知能セミナー)
知識グラフの埋め込みとその応用 (第10回ステアラボ人工知能セミナー)知識グラフの埋め込みとその応用 (第10回ステアラボ人工知能セミナー)
知識グラフの埋め込みとその応用 (第10回ステアラボ人工知能セミナー)
 
確率的推論と行動選択
確率的推論と行動選択確率的推論と行動選択
確率的推論と行動選択
 
ゲーム理論NEXT 期待効用理論第7/8/9回 -期待効用定理の証明1/2/3-
ゲーム理論NEXT 期待効用理論第7/8/9回 -期待効用定理の証明1/2/3-ゲーム理論NEXT 期待効用理論第7/8/9回 -期待効用定理の証明1/2/3-
ゲーム理論NEXT 期待効用理論第7/8/9回 -期待効用定理の証明1/2/3-
 
ゲーム理論 BASIC 演習73 -3人ゲーム分析:シャープレイ値-
ゲーム理論 BASIC 演習73 -3人ゲーム分析:シャープレイ値-ゲーム理論 BASIC 演習73 -3人ゲーム分析:シャープレイ値-
ゲーム理論 BASIC 演習73 -3人ゲーム分析:シャープレイ値-
 
Re:ゲーム理論入門 - ナッシュ均衡の存在証明
Re:ゲーム理論入門 - ナッシュ均衡の存在証明Re:ゲーム理論入門 - ナッシュ均衡の存在証明
Re:ゲーム理論入門 - ナッシュ均衡の存在証明
 
slides CIRM copulas, extremes and actuarial science
slides CIRM copulas, extremes and actuarial scienceslides CIRM copulas, extremes and actuarial science
slides CIRM copulas, extremes and actuarial science
 
A Course in Fuzzy Systems and Control Matlab Chapter Three
A Course in Fuzzy Systems and Control Matlab Chapter ThreeA Course in Fuzzy Systems and Control Matlab Chapter Three
A Course in Fuzzy Systems and Control Matlab Chapter Three
 
K to 12 math
K to 12 mathK to 12 math
K to 12 math
 
ゲーム理論BASIC 第42回 -仁に関する定理の証明3-
ゲーム理論BASIC 第42回 -仁に関する定理の証明3-ゲーム理論BASIC 第42回 -仁に関する定理の証明3-
ゲーム理論BASIC 第42回 -仁に関する定理の証明3-
 
Forward algorithm step by step
Forward algorithm step by stepForward algorithm step by step
Forward algorithm step by step
 
El text.life science6.matsubayashi191120
El text.life science6.matsubayashi191120El text.life science6.matsubayashi191120
El text.life science6.matsubayashi191120
 
Orthogonal basis and gram schmidth process
Orthogonal basis and gram schmidth processOrthogonal basis and gram schmidth process
Orthogonal basis and gram schmidth process
 
Maximum Likelihood Estimation of Beetle
Maximum Likelihood Estimation of BeetleMaximum Likelihood Estimation of Beetle
Maximum Likelihood Estimation of Beetle
 
Teoria Numérica (Palestra 01)
Teoria Numérica (Palestra 01)Teoria Numérica (Palestra 01)
Teoria Numérica (Palestra 01)
 

Recently uploaded

Digital Signal Processing Lecture notes n.pdf
Digital Signal Processing Lecture notes n.pdfDigital Signal Processing Lecture notes n.pdf
Digital Signal Processing Lecture notes n.pdf
AbrahamGadissa
 
ONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdf
ONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdfONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdf
ONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdf
Kamal Acharya
 
Online blood donation management system project.pdf
Online blood donation management system project.pdfOnline blood donation management system project.pdf
Online blood donation management system project.pdf
Kamal Acharya
 

Recently uploaded (20)

Digital Signal Processing Lecture notes n.pdf
Digital Signal Processing Lecture notes n.pdfDigital Signal Processing Lecture notes n.pdf
Digital Signal Processing Lecture notes n.pdf
 
Pharmacy management system project report..pdf
Pharmacy management system project report..pdfPharmacy management system project report..pdf
Pharmacy management system project report..pdf
 
The Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdfThe Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdf
 
RESORT MANAGEMENT AND RESERVATION SYSTEM PROJECT REPORT.pdf
RESORT MANAGEMENT AND RESERVATION SYSTEM PROJECT REPORT.pdfRESORT MANAGEMENT AND RESERVATION SYSTEM PROJECT REPORT.pdf
RESORT MANAGEMENT AND RESERVATION SYSTEM PROJECT REPORT.pdf
 
NO1 Pandit Amil Baba In Bahawalpur, Sargodha, Sialkot, Sheikhupura, Rahim Yar...
NO1 Pandit Amil Baba In Bahawalpur, Sargodha, Sialkot, Sheikhupura, Rahim Yar...NO1 Pandit Amil Baba In Bahawalpur, Sargodha, Sialkot, Sheikhupura, Rahim Yar...
NO1 Pandit Amil Baba In Bahawalpur, Sargodha, Sialkot, Sheikhupura, Rahim Yar...
 
Introduction to Machine Learning Unit-5 Notes for II-II Mechanical Engineering
Introduction to Machine Learning Unit-5 Notes for II-II Mechanical EngineeringIntroduction to Machine Learning Unit-5 Notes for II-II Mechanical Engineering
Introduction to Machine Learning Unit-5 Notes for II-II Mechanical Engineering
 
Introduction to Machine Learning Unit-4 Notes for II-II Mechanical Engineering
Introduction to Machine Learning Unit-4 Notes for II-II Mechanical EngineeringIntroduction to Machine Learning Unit-4 Notes for II-II Mechanical Engineering
Introduction to Machine Learning Unit-4 Notes for II-II Mechanical Engineering
 
Toll tax management system project report..pdf
Toll tax management system project report..pdfToll tax management system project report..pdf
Toll tax management system project report..pdf
 
ONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdf
ONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdfONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdf
ONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdf
 
Online blood donation management system project.pdf
Online blood donation management system project.pdfOnline blood donation management system project.pdf
Online blood donation management system project.pdf
 
BRAKING SYSTEM IN INDIAN RAILWAY AutoCAD DRAWING
BRAKING SYSTEM IN INDIAN RAILWAY AutoCAD DRAWINGBRAKING SYSTEM IN INDIAN RAILWAY AutoCAD DRAWING
BRAKING SYSTEM IN INDIAN RAILWAY AutoCAD DRAWING
 
Explosives Industry manufacturing process.pdf
Explosives Industry manufacturing process.pdfExplosives Industry manufacturing process.pdf
Explosives Industry manufacturing process.pdf
 
Peek implant persentation - Copy (1).pdf
Peek implant persentation - Copy (1).pdfPeek implant persentation - Copy (1).pdf
Peek implant persentation - Copy (1).pdf
 
Top 13 Famous Civil Engineering Scientist
Top 13 Famous Civil Engineering ScientistTop 13 Famous Civil Engineering Scientist
Top 13 Famous Civil Engineering Scientist
 
Halogenation process of chemical process industries
Halogenation process of chemical process industriesHalogenation process of chemical process industries
Halogenation process of chemical process industries
 
Architectural Portfolio Sean Lockwood
Architectural Portfolio Sean LockwoodArchitectural Portfolio Sean Lockwood
Architectural Portfolio Sean Lockwood
 
The Ultimate Guide to External Floating Roofs for Oil Storage Tanks.docx
The Ultimate Guide to External Floating Roofs for Oil Storage Tanks.docxThe Ultimate Guide to External Floating Roofs for Oil Storage Tanks.docx
The Ultimate Guide to External Floating Roofs for Oil Storage Tanks.docx
 
Furniture showroom management system project.pdf
Furniture showroom management system project.pdfFurniture showroom management system project.pdf
Furniture showroom management system project.pdf
 
Immunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary AttacksImmunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary Attacks
 
İTÜ CAD and Reverse Engineering Workshop
İTÜ CAD and Reverse Engineering WorkshopİTÜ CAD and Reverse Engineering Workshop
İTÜ CAD and Reverse Engineering Workshop
 

Q prop

  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6. St+1 ~ P( ′s | St ,At ) rt+1 = r(St ,At ,St+1) At ~ π( ′a | St )
  • 7. St+1 ~ P( ′s | St ,At ) rt+1 = r(St ,At ,St+1) At ~ π( ′a | St ) π∗ = argmax π Eπ [ γ τ rτ ] τ =0 ∞ ∑
  • 8. π∗ = argmax π Eπ [ γ τ rτ ] τ =0 ∞ ∑ = J ∇θ J
  • 9. ∇θ J = Eπθ [∇θ log(πθ (at | st ))Qt ] ∇θ J = Es∼ρ ∇aQµ s,a( )a=µθ s( ) ∇θ µθ s( )⎡ ⎣⎢ ⎤ ⎦⎥
  • 10. ∇θ J = ∇θ Eπθ [ γ τ rτ ] τ =0 ∞ ∑ = ∇θ Es0 ~ρ,s'~p πθ at ,st( ) γ τ rτ τ =0 ∞ ∑t=0 ∏ ⎡ ⎣ ⎢ ⎤ ⎦ ⎥ = Es0 ~ρ,s'~p ∇θ πθ at ,st( ) γ τ rτ τ =0 ∞ ∑t=0 ∏ ⎡ ⎣ ⎢ ⎤ ⎦ ⎥ = Es~ρ πθ at ,st( ) ∇θ πθ at ,st( ) t=0 ∏ πθ at ,st( ) t=0 ∏ γ τ rτ τ =0 ∞ ∑ t=0 ∏ ⎡ ⎣ ⎢ ⎢ ⎢ ⎤ ⎦ ⎥ ⎥ ⎥ = Es~ρ πθ (at | st ) ∇θ log(πθ (at | st )) t=0 ∑t=0 ∏ γ τ rτ τ =0 ∞ ∑ ⎡ ⎣ ⎢ ⎤ ⎦ ⎥ = Eπθ [ ∇θ log(πθ (at | st )) t=0 ∑ γ τ rτ τ =t ∞ ∑ ]
  • 11. ∇log p x( )( ) f x( )
  • 12. ∇log p x( )( ) f x( )
  • 13.
  • 14. J = Es∼ρ [Qµθ s,µθ s( )( )] ∇θ J = Es∼ρ ∇θQµ s,µθ s( )( )⎡⎣ ⎤⎦ = Es∼ρ ∇aQµ s,a( )a=µθ s( ) ∇θ µθ s( )⎡ ⎣⎢ ⎤ ⎦⎥
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21. f st ,at( )= f st ,at( )+ ∇a f st ,a( )a=at at − at( ) ∇θ J = Eρ,π ∇θ logπθ at st( ) Q st ,at( )− f st ,at( )( )⎡ ⎣ ⎤ ⎦ + Eρ,π ∇θ logπθ at st( ) f st ,at( )⎡ ⎣ ⎤ ⎦ = Eρ,π ∇θ logπθ at st( ) Q st ,at( )− f st ,at( )( )⎡ ⎣ ⎤ ⎦ + Eρ,π ∇a f st ,a( )a=at ∇θ µθ st( )⎡ ⎣ ⎤ ⎦
  • 22.
  • 23. ∇θ J = Eρ,π ∇θ logπθ at st( ) Q st ,at( )−Qw st ,at( )( )⎡ ⎣ ⎤ ⎦ + Eρ,π ∇aQw st ,a( )a=at ∇θ µθ st( )⎡ ⎣ ⎤ ⎦ ∇θ J = Eρ,π ∇θ logπθ at st( ) A st ,at( )− Aw st ,at( )( )⎡ ⎣ ⎤ ⎦ + Eρ,π ∇aQw st ,a( )a=at ∇θ µθ st( )⎡ ⎣ ⎤ ⎦ a
  • 24. ∇θ J = Eρ,π ∇θ logπθ at st( ) A st ,at( )− Aw st ,at( )( )⎡ ⎣ ⎤ ⎦ + Eρ,π ∇aQw st ,a( )a=at ∇θ µθ st( )⎡ ⎣ ⎤ ⎦ Aw = Qw st ,at( )− Eπ Qw st ,at( )⎡⎣ ⎤⎦ = Qw st ,µθ st( )( )+ ∇aQw st ,a( )a=µθ st( ) at − µθ st( )( )− Eπ Qw st ,µθ st( )( )+ ∇aQw st ,a( )a=µθ st( ) at − µθ st( )( )⎡ ⎣⎢ ⎤ ⎦⎥ = ∇aQw st ,a( )a=µθ st( ) at − µθ st( )( ) rt+1 +γV st+1( )−V st( ) Eπ at[ ]= µθ st( )
  • 25.
  • 26. m* = m −η(t −τ ) E m* ⎡⎣ ⎤⎦ = E m[ ] Var m* ⎡⎣ ⎤⎦ = Var m[ ]− 2ηCov m,t[ ]+η2 Var t[ ] η* = Cov m,t[ ] Var t[ ]
  • 27. ∇θ J = Eρ,π ∇θ logπθ at st( ) A st ,at( )−η st( )Aw st ,at( )( )⎡ ⎣ ⎤ ⎦ + Eρ,π η st( )∇aQw st ,a( )a=at ∇θ µθ st( )⎡ ⎣ ⎤ ⎦ Var A −ηAw⎡⎣ ⎤⎦ = Var A[ ]− 2ηCov A,Aw( )+η2 Var Aw( ) η* = Cov A,Aw( ) Var Aw( )