http://wikibook.co.kr/reinforcement-learning/
https://github.com/wooridle/DeepRL-PPO-tutorial
http://web.stanford.edu/class/cs234/slides/lecture1_introduction.pdf
𝑠", 𝑎", 𝑟&, 𝑠&, 𝑎&, 𝑟', ⋯ , 𝑠)	
𝑃,,-
.
𝛾
𝜋 𝑎	 	𝑠)
𝑞(𝑠, 𝑎) = 𝑬 𝑅78& + 𝛾𝑅78' +	⋯	|𝑆7 = 𝑠, 𝐴7 = 𝑎
	𝑣 𝑠 		= 		𝑬 𝑅78& + 𝛾𝑅78' +	⋯	|𝑆7 = 𝑠
•
•
•
•
•
•
•
• 𝑃,,-
.
𝑠>
• 𝛾
• 𝜋 𝑎	 	𝑠)
𝜃 𝜃
https://sites.google.com/view/deep-rl-bootcamp/lectures
𝜃
𝜋@(𝑎|𝑠) = 𝑷 𝐴7 = 𝑎|𝑆7 = 𝑠, 𝜃
𝐽 𝜃 = 𝑬 C 𝑟7
)
7D&
|	𝜋@
= 𝑬 𝑟& + 𝑟' + 𝑟E + ⋯ + 𝑟)|	𝜋@
𝐽 𝜃
𝜃
https://www.slideshare.net/WoongwonLee/rlcode-a3c
RLCode와 A3C쉽고 깊게 이해하기
𝜃
𝑎𝑟𝑔𝑚𝑎𝑥	𝐽 𝜃
@
	 𝑎𝑟𝑔𝑚𝑎𝑥
@
	𝐸 ∑ 𝑟7
)
7D" |	𝜋@
𝑎𝑟𝑔𝑚𝑎𝑥
@
∑ 𝑃 𝑠7, 𝑎7|𝜏 𝑟78&
)L&
7D"
𝜃
𝜽 ← 𝜽 + 𝜶𝛁 𝜽 𝑱(𝜽)
𝜏 = 𝑠", 𝑎", 𝑟&, 𝑠&, 𝑎&, 𝑟', ⋯ , 𝑠)
http://rll.berkeley.edu/deeprlcourse/#lecture-videos
1.
𝜏 = 𝑠", 𝑎", 𝑟&, 𝑠&, 𝑎&, 𝑟', ⋯ , 𝑠)
2. 𝜵 𝜽 𝑱 𝜽 	
𝛁 𝜽 𝒍𝒐𝒈𝝅 𝜽 𝒂 𝒕 𝒔 𝒕 ∑ 𝒓(𝒔, 𝒂)𝒕
𝒕D𝟏
3.
4.
http://rll.berkeley.edu/deeprlcourse/#lecture-videos
1.
2. ∑ 𝒓(𝒔, 𝒂)𝒕
𝒕D𝟏
𝐺7
𝛁 𝜽 𝑱 𝜽 	~𝛁 𝜽 𝒍𝒐𝒈𝝅 𝜽 𝒂 𝒕 𝒔 𝒕 C 𝑟7(𝑠, 𝑎)
)
7D"
− 𝑽 𝒗(𝒔 𝒕)	
à
- 𝑟78& + 𝛾𝑉c(𝑠78&) − 𝑉c(𝑠7)
𝛁 𝜽 𝑱 𝜽 	~𝛁 𝜽 𝒍𝒐𝒈𝝅 𝜽 𝒂 𝒕 𝒔 𝒕 𝑸 𝒘 𝒔 𝒕, 𝒂 𝒕 − 𝑽 𝒗(𝒔 𝒕)
𝛻@ 𝐽 𝜃 ~ C 𝛻@ 𝑙𝑜𝑔𝜋@ 𝑎7 𝑠7
)L&
7D"
𝐺7 𝛻@ 𝐽 𝜃 ~ C 𝛻@ 𝑙𝑜𝑔𝜋@ 𝑎7 𝑠7
)L&
7D"
(𝑟78& + 𝛾𝑉c 𝑠78& − 𝑉c 𝑠7 )
𝛻@ 𝑙𝑜𝑔𝜋@ 𝑎7 𝑠7 𝑟78& + 𝛾𝑉c(𝑠78&) − 𝑉c(𝑠7) 𝑟78& + 𝛾𝑉c 𝑠78& − 𝑉c 𝑠7
'
https://www.slideshare.net/WoongwonLee/rlcode-a3c
RLCode와 A3C쉽고 깊게 이해하기 / 이웅원
𝑟78& + 𝛾𝑉c 𝑠78& − 𝑉c 𝑠7
'
𝛻@ 𝑙𝑜𝑔𝜋@ 𝑎7 𝑠7 𝑟78& + 𝛾𝑉c(𝑠78&) − 𝑉c(𝑠7)
𝛿
𝑚𝑎𝑥𝑖𝑚𝑖𝑧𝑒	𝛻@ 𝐽 𝜃 ~ C 𝛻@ 𝑙𝑜𝑔𝜋@ 𝑎7 𝑠7
)L&
7D"
𝐴7
1.
2.
𝑟7(𝜃)
𝑟7 𝜃
- 𝜀
1.
2. =		
nop	qrstuv
wsx	qrstuv
Doing Deep Reinforcement learning with PPO
Doing Deep Reinforcement learning with PPO
Doing Deep Reinforcement learning with PPO
Doing Deep Reinforcement learning with PPO
Doing Deep Reinforcement learning with PPO
Doing Deep Reinforcement learning with PPO
Doing Deep Reinforcement learning with PPO
Doing Deep Reinforcement learning with PPO

Doing Deep Reinforcement learning with PPO