SlideShare a Scribd company logo
PLAYING ATARI WITH DEEP REINFORCEMENT LEARNING
TENSORFLOW + KERAS & OPENAI GYM
1
CONTENTS
Playing Atari Deep Reinforcement Learning
 Playing Atari with Deep Reinforcement Learning
 Human Level Control through Deep
Reinforcement Learning
 Deep Reinforcement Learning with Q-Learning
2
PLAYING ATARI WITH DEEP REINFORCEMENT
LEARNING
3
ATARI 2600
http://atariage.com/index.php
Atari 2600是1976年發行的經典遊戲主
機
 史上第一部家用電子遊戲機
 支援160 X 192解析度螢幕,最高128色,主機上
有 128 Byte RAM和 6KB ROM
 FC 紅白機十年之後才出現
4
PLAYING ATARI WITH DEEP REINFORCEMENT LEARNING
DeepMind Object is to find an optimal policy
 展示了如何讓電腦學習玩 Atari 2600 遊戲
 這個結果引人注目的地方在於電腦只觀察螢幕圖
元並在遊戲得分增加時接收獎勵
 相同模型架構
 學習七種不同遊戲
 其中三個遊戲玩得比人類好
5
HUMAN LEVEL
Original Results on Atari Games Beating Human Level
6
A3C (ASYNCHRONOUS ADVANTAGE ACTOR-CRITIC)
RESULTS ON ATARI GAMES
7
PLAYING ATARI
8
PLAYING ATARI WITH DEEP REINFORCEMENT LEARNING
Reinforcement Learning Object is to find an optimal policy
1. Given Current State
2. Take an Action based on state
3. Get current Reward
9
BREAKOUT
Tested on Ubuntu 16.04 Breakout
 State
 球在螢幕上的位置
 Action
 訓練電腦玩遊戲
 Input:螢幕截圖
 Outpu:控制Paddle左、右、發球
 Reward
 螢幕上半部分有很多磚塊,球碰到磚塊會將它擊碎,你
會得分
10
RESOURCES
Playing Atari with Deep
Reinforcement Learning
 https://courses.cs.ut.ee/MTAT.03.291/2014_sprin
g/uploads/Main/Replicating%20DeepMind.pdf
Replicating-DeepMind
 https://github.com/kristjankorjus/Replicating-
DeepMind
11
RESOURCES
DeepMind Atari Deep Q Learner
 https://github.com/kuz/DeepMind-Atari-Deep-
Q-Learner
 LuaJIT and Torch 7.0
 nngraph
 Xitari (fork of the Arcade Learning Environment
(Bellemare et al., 2013))
 AleWrap (a lua interface to Xitari) An install script
for these dependencies is provided.
Asyncronous RL in Tensorflow + Keras
OpenAI's Gym
 https://github.com/coreylynch/async-rl
 tensorflow
 gym
 [gym's atari environment]
(https://github.com/openai/gym#atari)
 skimage
 Keras
12
RESOURCES
The Arcade Learning Environment
 http://www.arcadelearningenvironment.org/
ALE (Visual Studio Version)
 https://github.com/mvacha/A.L.E.-0.4.4.-Visual-
Studio
13
APT-GET INSTALL
 libtiff5-dev
 libjpeg8-dev
 zlib1g-dev
 liblcms2-dev
 libwebp-dev
 tcl8.6-dev
 tk8.5-dev
 python-tk
 cmake
 xvfb
14
DEEP NEURAL NETWORKS
 Tensor Flow is a good flexible deep learning
framework
 Backpropagation and deep neural network do a
lot the reinforcement learning challenge is how
you find the best loss function to train
15
HOW TO RUN AI AGENTS ON GAMES?
https://gym.openai.com/ OpenAI Gym
 Library of Environments
 Pong
 Breakout
 Cart-Pole
 Same API
 Provides way to share and compare results
16
HOW TO RUN AI AGENTS ON GAMES?
https://gym.openai.com/ Pip install -e '.[atari]'
import gym
env = gym.make('SpaceInvaders-v0')
obs = env.reset()
env.render()
ob, reward, done, _ = env.step(action)
17
OTHER OPTIONS
https://github.com/DanielSlater/PyGamePla
yer PyGame
 1000’s of games
 Easy to change game code
 PyGamePlayer
 Half pong
18
PYTHON ASYNC_DQN.PY --EXPERIMENT BREAKOUT --GAME
"BREAKOUT-V0" --NUM_CONCURRENT 8
Checkpoints
/tmp/checkpoints/
TensorBoard Summary
tensorboard --logdir
/tmp/summaries/breakout
"created":1485854183,
"episode_types":["t"],
"episode_lengths":[1717],
"object":"episode_batch",
"initial_reset_timestamps":[
1485853848.3293480873],
"episode_rewards":[62.0],
"data_sources":[0],
"seeds":[],
"main_seeds":[],
"timestamps":[1485853853.
9296009541],
"env_id":"Breakout-v0",
"initial_reset_timestamp":1
485853848.3293480873,
"id":"eb_taFBJqLFThuZ5jBw
O0NFTQ"
tensorboard --logdir /tmp/summaries/breakout
19
ALE GRAYSCALE CONVERSION METHOD
RGB images grayscale conversion
20
SCREENSHOT
frame skipping maximum over two consecutive frames
21
100-EPISODE (2 HOURS) AVERAGE REWARD WAS 68.97
Training episode batch video (mp4) Visualizing training with tensorboard
22
VISUALIZING TRAINING WITH TENSORBOARD
Episode Reward Max Q Value
23
REINFORCEMENT LEARNING
24
MARKOV DECISION PROCESS
 選擇這些行動的策略
 一般來說環境是隨機的
 下一個狀態的出現也是隨機的
 MDP < S, A, P, R, 𝛾 >
 S: set of states
 A: set of actions
 T(s, a, s’): probability of transition
 Reward(s): reward function
 𝛾: discounting factory
 Trace: {<s0,a0,r0>, …, <sn,an,rn>}
25
Convolutional networks Network architecture
26
REINFORCEMENT LEARNING
3 categories of reinforcement learning
 Value learning : Q-learning
 給定一個狀態和一組可能的行動,決定採取最佳的
獎勵的行動
 Policy learning : Policy gradients
 使用Gradients找到最佳策略
 Model learning
 學習在不同狀態間的轉換
 Min-Max
 Monte-Carlo sampling
Definitions
 Return: total discounted reward:
 Policy: Agent’s behavior
 Deterministic policy: π(s) = a
 Stochastic policy: π(a | s) = P[At = a | St = s]
 Value function: Expected return starting from
state s:
 State-value function: Vπ(s) = Eπ[R | St = s]
 Action-value function: Qπ(s, a) = Eπ[R | St = s, At =
a]
27
LEARNING
Deep Q Learning
 Model-free, off-policy technique to learn optimal Q(s, a):
 Qi+1(s, a) ← Qi(s, a) + 𝛼(R + 𝛾 maxa’ Qi(s’, a’) - Qi(s, a))
 Optimal policy then π(s) = argmaxa’ Q(s, a’)
 Requires exploration (ε-greedy) to explore various transitions from the
states.
 Take random action with ε probability, start ε high and decay to low
value as training progresses.
 Deep Q Learning: approximate Q(s, a) with neural network: Q(s, a, 𝜃)
 Do stochastic gradient descent using loss
 L(𝜃) = MSEs, a(Q(s, a, 𝜃i), r + 𝛾maxa’Q(s, a’, 𝜃i - 1))
Policy Gradient
 Given policy π𝜃(a | s) find such 𝜃 that maximizes
expected return:
 J(𝜃) = ∑sdπ(s)V(s)
 In Deep RL, we approximate π𝜃(a | s) with neural
network.
 Usually with softmax layer on top to estimate
probabilities of each action.
 We can estimate J(𝜃) from samples of observed
behavior: ∑k=0..Tp𝜃(𝜏k | π)R(𝜏k)
 Do stochastic gradient descent using update:
 𝜃i+1 = 𝜃i + 𝛼 (1/T) ∑k=0..T ∇log p𝜃(𝜏k | π)R(𝜏k)
28
DQN OPTIMIZATION
29
ASYNC ADVANTAGE ACTOR-CRITIC (A3C)
 Asynchronous: using multiple instances of
environments and networks
 Actor-Critic: using both policy and estimate of
value function.
 Advantage: estimate how different was outcome
than expected.
30
TENSORFLOW-RL/EXAMPLES/ATARI-RL.PY
31
ACTING
Environment Random Agent
32
Q-NETWORK
Q-network Optimization
33
Q-NETWORK
Q-network Layer
 Convolutional Layer
 16 個 8 x 8 ,輸出採樣間隔為 4 x 4,並加 ReLU 非線性啟動函數
 32 個 4 x 4 ,輸出採樣間隔為 2 x 2,並加 ReLU 非線性啟動函數
 Flatten
 將回應展開為一維向量
 Fully-Connected Layer
 256 個神經元,加 ReLU 非線性啟動函數
 num_actions 個神經元,加線性啟動函數,對應每個 action 的 score
值(稱為 Q 值)
 Pooling Layer
 none
34
Q-NETWORK
Q-network Monitored Training Session
35
POLICY NETWORK
Policy Network Optimization
36
POLICY AND VALUE AND POLICY NETWORKS
Networks optimization
37
PROBLEM
temporal credit assignment
 時間效益分配
 先前的行動會影響到當前的收益的獲得
 動作的先後影響力
 experience replay
 所有的經驗<P,A,R‘,S’>都存放在一個資料表
balance exploration-exploit
 平衡行動
 利用已有的策略
 還是探索其他可能更好的策略
 greedy exploration
 按照最高的Q Value進行貪心行動
 機率選擇一個隨機行動
38
THANK YOU!
39

More Related Content

What's hot

Raspberry Pi with Java (JJUG)
Raspberry Pi with Java (JJUG)Raspberry Pi with Java (JJUG)
Raspberry Pi with Java (JJUG)
Stephen Chin
 
Oracle IoT Kids Workshop
Oracle IoT Kids WorkshopOracle IoT Kids Workshop
Oracle IoT Kids Workshop
Stephen Chin
 
Raspberry Pi Gaming 4 Kids (Devoxx4Kids)
Raspberry Pi Gaming 4 Kids (Devoxx4Kids)Raspberry Pi Gaming 4 Kids (Devoxx4Kids)
Raspberry Pi Gaming 4 Kids (Devoxx4Kids)
Stephen Chin
 
Fun with sensors - JSConf.asia 2014
Fun with sensors - JSConf.asia 2014Fun with sensors - JSConf.asia 2014
Fun with sensors - JSConf.asia 2014
Jan Jongboom
 
JCrete Embedded Java Workshop
JCrete Embedded Java WorkshopJCrete Embedded Java Workshop
JCrete Embedded Java Workshop
Stephen Chin
 
The most awesome build ever!
The most awesome build ever!The most awesome build ever!
The most awesome build ever!
Christine Shock
 
Android dev toolbox - Shem Magnezi, WeWork
Android dev toolbox - Shem Magnezi, WeWorkAndroid dev toolbox - Shem Magnezi, WeWork
Android dev toolbox - Shem Magnezi, WeWork
DroidConTLV
 
AlphaGo and AlphaGo Zero
AlphaGo and AlphaGo ZeroAlphaGo and AlphaGo Zero
AlphaGo and AlphaGo Zero
☕ Keita Watanabe
 
The Ring programming language version 1.5.1 book - Part 45 of 180
The Ring programming language version 1.5.1 book - Part 45 of 180The Ring programming language version 1.5.1 book - Part 45 of 180
The Ring programming language version 1.5.1 book - Part 45 of 180
Mahmoud Samir Fayed
 
Internet of Things Magic Show
Internet of Things Magic ShowInternet of Things Magic Show
Internet of Things Magic Show
Stephen Chin
 
Home Automation with Android Things and the Google Assistant
Home Automation with Android Things and the Google AssistantHome Automation with Android Things and the Google Assistant
Home Automation with Android Things and the Google Assistant
Nilhcem
 
Kotlin - Coroutine
Kotlin - CoroutineKotlin - Coroutine
Kotlin - Coroutine
Sean Tsai
 
Minko stage3d workshop_20130525
Minko stage3d workshop_20130525Minko stage3d workshop_20130525
Minko stage3d workshop_20130525Minko3D
 
libGDX: Scene2D
libGDX: Scene2DlibGDX: Scene2D
libGDX: Scene2D
Jussi Pohjolainen
 
Box2D and libGDX
Box2D and libGDXBox2D and libGDX
Box2D and libGDX
Jussi Pohjolainen
 
Ernst kuilder (Nelen & Schuurmans) - De waterkaart van Nederland: technisch g...
Ernst kuilder (Nelen & Schuurmans) - De waterkaart van Nederland: technisch g...Ernst kuilder (Nelen & Schuurmans) - De waterkaart van Nederland: technisch g...
Ernst kuilder (Nelen & Schuurmans) - De waterkaart van Nederland: technisch g...
Frederik Gevers Deynoot
 
Connecting your phone and home with firebase and android things - James Cogga...
Connecting your phone and home with firebase and android things - James Cogga...Connecting your phone and home with firebase and android things - James Cogga...
Connecting your phone and home with firebase and android things - James Cogga...
DroidConTLV
 
Cross-scene references: A shock to the system - Unite Copenhagen 2019
Cross-scene references: A shock to the system - Unite Copenhagen 2019Cross-scene references: A shock to the system - Unite Copenhagen 2019
Cross-scene references: A shock to the system - Unite Copenhagen 2019
Unity Technologies
 
RetroPi Handheld Raspberry Pi Gaming Console
RetroPi Handheld Raspberry Pi Gaming ConsoleRetroPi Handheld Raspberry Pi Gaming Console
RetroPi Handheld Raspberry Pi Gaming Console
Stephen Chin
 
NoiseGen at Arlington Ruby 2012
NoiseGen at Arlington Ruby 2012NoiseGen at Arlington Ruby 2012
NoiseGen at Arlington Ruby 2012
awwaiid
 

What's hot (20)

Raspberry Pi with Java (JJUG)
Raspberry Pi with Java (JJUG)Raspberry Pi with Java (JJUG)
Raspberry Pi with Java (JJUG)
 
Oracle IoT Kids Workshop
Oracle IoT Kids WorkshopOracle IoT Kids Workshop
Oracle IoT Kids Workshop
 
Raspberry Pi Gaming 4 Kids (Devoxx4Kids)
Raspberry Pi Gaming 4 Kids (Devoxx4Kids)Raspberry Pi Gaming 4 Kids (Devoxx4Kids)
Raspberry Pi Gaming 4 Kids (Devoxx4Kids)
 
Fun with sensors - JSConf.asia 2014
Fun with sensors - JSConf.asia 2014Fun with sensors - JSConf.asia 2014
Fun with sensors - JSConf.asia 2014
 
JCrete Embedded Java Workshop
JCrete Embedded Java WorkshopJCrete Embedded Java Workshop
JCrete Embedded Java Workshop
 
The most awesome build ever!
The most awesome build ever!The most awesome build ever!
The most awesome build ever!
 
Android dev toolbox - Shem Magnezi, WeWork
Android dev toolbox - Shem Magnezi, WeWorkAndroid dev toolbox - Shem Magnezi, WeWork
Android dev toolbox - Shem Magnezi, WeWork
 
AlphaGo and AlphaGo Zero
AlphaGo and AlphaGo ZeroAlphaGo and AlphaGo Zero
AlphaGo and AlphaGo Zero
 
The Ring programming language version 1.5.1 book - Part 45 of 180
The Ring programming language version 1.5.1 book - Part 45 of 180The Ring programming language version 1.5.1 book - Part 45 of 180
The Ring programming language version 1.5.1 book - Part 45 of 180
 
Internet of Things Magic Show
Internet of Things Magic ShowInternet of Things Magic Show
Internet of Things Magic Show
 
Home Automation with Android Things and the Google Assistant
Home Automation with Android Things and the Google AssistantHome Automation with Android Things and the Google Assistant
Home Automation with Android Things and the Google Assistant
 
Kotlin - Coroutine
Kotlin - CoroutineKotlin - Coroutine
Kotlin - Coroutine
 
Minko stage3d workshop_20130525
Minko stage3d workshop_20130525Minko stage3d workshop_20130525
Minko stage3d workshop_20130525
 
libGDX: Scene2D
libGDX: Scene2DlibGDX: Scene2D
libGDX: Scene2D
 
Box2D and libGDX
Box2D and libGDXBox2D and libGDX
Box2D and libGDX
 
Ernst kuilder (Nelen & Schuurmans) - De waterkaart van Nederland: technisch g...
Ernst kuilder (Nelen & Schuurmans) - De waterkaart van Nederland: technisch g...Ernst kuilder (Nelen & Schuurmans) - De waterkaart van Nederland: technisch g...
Ernst kuilder (Nelen & Schuurmans) - De waterkaart van Nederland: technisch g...
 
Connecting your phone and home with firebase and android things - James Cogga...
Connecting your phone and home with firebase and android things - James Cogga...Connecting your phone and home with firebase and android things - James Cogga...
Connecting your phone and home with firebase and android things - James Cogga...
 
Cross-scene references: A shock to the system - Unite Copenhagen 2019
Cross-scene references: A shock to the system - Unite Copenhagen 2019Cross-scene references: A shock to the system - Unite Copenhagen 2019
Cross-scene references: A shock to the system - Unite Copenhagen 2019
 
RetroPi Handheld Raspberry Pi Gaming Console
RetroPi Handheld Raspberry Pi Gaming ConsoleRetroPi Handheld Raspberry Pi Gaming Console
RetroPi Handheld Raspberry Pi Gaming Console
 
NoiseGen at Arlington Ruby 2012
NoiseGen at Arlington Ruby 2012NoiseGen at Arlington Ruby 2012
NoiseGen at Arlington Ruby 2012
 

Similar to Tensorflow + Keras & Open AI Gym

AlphaGo Zero: Mastering the Game of Go Without Human Knowledge
AlphaGo Zero: Mastering the Game of Go Without Human KnowledgeAlphaGo Zero: Mastering the Game of Go Without Human Knowledge
AlphaGo Zero: Mastering the Game of Go Without Human Knowledge
Joonhyung Lee
 
Introduction to PyTorch
Introduction to PyTorchIntroduction to PyTorch
Introduction to PyTorch
Jun Young Park
 
pycon2018 "RL Adventure : DQN 부터 Rainbow DQN까지"
pycon2018 "RL Adventure : DQN 부터 Rainbow DQN까지"pycon2018 "RL Adventure : DQN 부터 Rainbow DQN까지"
pycon2018 "RL Adventure : DQN 부터 Rainbow DQN까지"
YeChan(Paul) Kim
 
Advanced Approach to Play a Atari Game
Advanced Approach to Play a Atari GameAdvanced Approach to Play a Atari Game
Advanced Approach to Play a Atari Game
ijceronline
 
Beauty and the beast - Haskell on JVM
Beauty and the beast  - Haskell on JVMBeauty and the beast  - Haskell on JVM
Beauty and the beast - Haskell on JVM
Jarek Ratajski
 
Eta
EtaEta
Practical Reinforcement Learning with TensorFlow
Practical Reinforcement Learning with TensorFlowPractical Reinforcement Learning with TensorFlow
Practical Reinforcement Learning with TensorFlow
Illia Polosukhin
 
PyCon2009_AI_Alt
PyCon2009_AI_AltPyCon2009_AI_Alt
PyCon2009_AI_AltHiroshi Ono
 
Introduction to Alphago Zero
Introduction to Alphago ZeroIntroduction to Alphago Zero
Introduction to Alphago Zero
Chia-Ching Lin
 
ruby2600 - an Atari 2600 emulator written in Ruby
ruby2600 - an Atari 2600 emulator written in Rubyruby2600 - an Atari 2600 emulator written in Ruby
ruby2600 - an Atari 2600 emulator written in Ruby
Carlos Duarte do Nascimento
 
La question de la durabilité des technologies de calcul et de télécommunication
La question de la durabilité des technologies de calcul et de télécommunicationLa question de la durabilité des technologies de calcul et de télécommunication
La question de la durabilité des technologies de calcul et de télécommunication
Alexandre Monnin
 
Programming simple games with a raspberry pi and
Programming simple games with a raspberry pi andProgramming simple games with a raspberry pi and
Programming simple games with a raspberry pi and
Kellyn Pot'Vin-Gorman
 
Unix executable buffer overflow
Unix executable buffer overflowUnix executable buffer overflow
Unix executable buffer overflow
Ammarit Thongthua ,CISSP CISM GXPN CSSLP CCNP
 
Poker, packets, pipes and Python
Poker, packets, pipes and PythonPoker, packets, pipes and Python
Poker, packets, pipes and Python
Roger Barnes
 
NTU ML TENSORFLOW
NTU ML TENSORFLOWNTU ML TENSORFLOW
NTU ML TENSORFLOW
Mark Chang
 
Techtalk Rolling Scopes 2017 neural networks
Techtalk Rolling Scopes 2017 neural networksTechtalk Rolling Scopes 2017 neural networks
Techtalk Rolling Scopes 2017 neural networks
Vsevolod Rodionov
 
Rob Sullivan at Heroku's Waza 2013: Your Database -- A Story of Indifference
Rob Sullivan at Heroku's Waza 2013: Your Database -- A Story of IndifferenceRob Sullivan at Heroku's Waza 2013: Your Database -- A Story of Indifference
Rob Sullivan at Heroku's Waza 2013: Your Database -- A Story of Indifference
Heroku
 
(Alpha) Zero to Elo (with demo)
(Alpha) Zero to Elo (with demo)(Alpha) Zero to Elo (with demo)
(Alpha) Zero to Elo (with demo)
MeetupDataScienceRoma
 
«Python на острие бритвы: PyPy project» Александр Кошкин, Positive Technologies
«Python на острие бритвы: PyPy project» Александр Кошкин, Positive Technologies«Python на острие бритвы: PyPy project» Александр Кошкин, Positive Technologies
«Python на острие бритвы: PyPy project» Александр Кошкин, Positive Technologies
it-people
 
Artificial intelligence - python
Artificial intelligence - pythonArtificial intelligence - python
Artificial intelligence - python
Sunjid Hasan
 

Similar to Tensorflow + Keras & Open AI Gym (20)

AlphaGo Zero: Mastering the Game of Go Without Human Knowledge
AlphaGo Zero: Mastering the Game of Go Without Human KnowledgeAlphaGo Zero: Mastering the Game of Go Without Human Knowledge
AlphaGo Zero: Mastering the Game of Go Without Human Knowledge
 
Introduction to PyTorch
Introduction to PyTorchIntroduction to PyTorch
Introduction to PyTorch
 
pycon2018 "RL Adventure : DQN 부터 Rainbow DQN까지"
pycon2018 "RL Adventure : DQN 부터 Rainbow DQN까지"pycon2018 "RL Adventure : DQN 부터 Rainbow DQN까지"
pycon2018 "RL Adventure : DQN 부터 Rainbow DQN까지"
 
Advanced Approach to Play a Atari Game
Advanced Approach to Play a Atari GameAdvanced Approach to Play a Atari Game
Advanced Approach to Play a Atari Game
 
Beauty and the beast - Haskell on JVM
Beauty and the beast  - Haskell on JVMBeauty and the beast  - Haskell on JVM
Beauty and the beast - Haskell on JVM
 
Eta
EtaEta
Eta
 
Practical Reinforcement Learning with TensorFlow
Practical Reinforcement Learning with TensorFlowPractical Reinforcement Learning with TensorFlow
Practical Reinforcement Learning with TensorFlow
 
PyCon2009_AI_Alt
PyCon2009_AI_AltPyCon2009_AI_Alt
PyCon2009_AI_Alt
 
Introduction to Alphago Zero
Introduction to Alphago ZeroIntroduction to Alphago Zero
Introduction to Alphago Zero
 
ruby2600 - an Atari 2600 emulator written in Ruby
ruby2600 - an Atari 2600 emulator written in Rubyruby2600 - an Atari 2600 emulator written in Ruby
ruby2600 - an Atari 2600 emulator written in Ruby
 
La question de la durabilité des technologies de calcul et de télécommunication
La question de la durabilité des technologies de calcul et de télécommunicationLa question de la durabilité des technologies de calcul et de télécommunication
La question de la durabilité des technologies de calcul et de télécommunication
 
Programming simple games with a raspberry pi and
Programming simple games with a raspberry pi andProgramming simple games with a raspberry pi and
Programming simple games with a raspberry pi and
 
Unix executable buffer overflow
Unix executable buffer overflowUnix executable buffer overflow
Unix executable buffer overflow
 
Poker, packets, pipes and Python
Poker, packets, pipes and PythonPoker, packets, pipes and Python
Poker, packets, pipes and Python
 
NTU ML TENSORFLOW
NTU ML TENSORFLOWNTU ML TENSORFLOW
NTU ML TENSORFLOW
 
Techtalk Rolling Scopes 2017 neural networks
Techtalk Rolling Scopes 2017 neural networksTechtalk Rolling Scopes 2017 neural networks
Techtalk Rolling Scopes 2017 neural networks
 
Rob Sullivan at Heroku's Waza 2013: Your Database -- A Story of Indifference
Rob Sullivan at Heroku's Waza 2013: Your Database -- A Story of IndifferenceRob Sullivan at Heroku's Waza 2013: Your Database -- A Story of Indifference
Rob Sullivan at Heroku's Waza 2013: Your Database -- A Story of Indifference
 
(Alpha) Zero to Elo (with demo)
(Alpha) Zero to Elo (with demo)(Alpha) Zero to Elo (with demo)
(Alpha) Zero to Elo (with demo)
 
«Python на острие бритвы: PyPy project» Александр Кошкин, Positive Technologies
«Python на острие бритвы: PyPy project» Александр Кошкин, Positive Technologies«Python на острие бритвы: PyPy project» Александр Кошкин, Positive Technologies
«Python на острие бритвы: PyPy project» Александр Кошкин, Positive Technologies
 
Artificial intelligence - python
Artificial intelligence - pythonArtificial intelligence - python
Artificial intelligence - python
 

More from HO-HSUN LIN

以太坊(Ethereum) solidity & web3.js
以太坊(Ethereum) solidity & web3.js以太坊(Ethereum) solidity & web3.js
以太坊(Ethereum) solidity & web3.js
HO-HSUN LIN
 
區塊鏈與金融科技(Blockchain and Fintech)
區塊鏈與金融科技(Blockchain and Fintech)區塊鏈與金融科技(Blockchain and Fintech)
區塊鏈與金融科技(Blockchain and Fintech)
HO-HSUN LIN
 
Microsoft CNTK, Cognitive Toolkit 微軟深度學習工具
Microsoft CNTK, Cognitive Toolkit 微軟深度學習工具Microsoft CNTK, Cognitive Toolkit 微軟深度學習工具
Microsoft CNTK, Cognitive Toolkit 微軟深度學習工具HO-HSUN LIN
 
Chaincode Development 區塊鏈鏈碼開發
Chaincode Development 區塊鏈鏈碼開發Chaincode Development 區塊鏈鏈碼開發
Chaincode Development 區塊鏈鏈碼開發
HO-HSUN LIN
 
Net Parallel Programming .NET平行處理與執行序
Net Parallel Programming .NET平行處理與執行序Net Parallel Programming .NET平行處理與執行序
Net Parallel Programming .NET平行處理與執行序HO-HSUN LIN
 
SQL Loader & Bulk Insert 大量資料匯入工具
SQL Loader & Bulk Insert 大量資料匯入工具SQL Loader & Bulk Insert 大量資料匯入工具
SQL Loader & Bulk Insert 大量資料匯入工具
HO-HSUN LIN
 

More from HO-HSUN LIN (7)

以太坊(Ethereum) solidity & web3.js
以太坊(Ethereum) solidity & web3.js以太坊(Ethereum) solidity & web3.js
以太坊(Ethereum) solidity & web3.js
 
區塊鏈與金融科技(Blockchain and Fintech)
區塊鏈與金融科技(Blockchain and Fintech)區塊鏈與金融科技(Blockchain and Fintech)
區塊鏈與金融科技(Blockchain and Fintech)
 
Microsoft CNTK, Cognitive Toolkit 微軟深度學習工具
Microsoft CNTK, Cognitive Toolkit 微軟深度學習工具Microsoft CNTK, Cognitive Toolkit 微軟深度學習工具
Microsoft CNTK, Cognitive Toolkit 微軟深度學習工具
 
Chaincode Development 區塊鏈鏈碼開發
Chaincode Development 區塊鏈鏈碼開發Chaincode Development 區塊鏈鏈碼開發
Chaincode Development 區塊鏈鏈碼開發
 
Net Parallel Programming .NET平行處理與執行序
Net Parallel Programming .NET平行處理與執行序Net Parallel Programming .NET平行處理與執行序
Net Parallel Programming .NET平行處理與執行序
 
ASP.NET AJAX
ASP.NET AJAXASP.NET AJAX
ASP.NET AJAX
 
SQL Loader & Bulk Insert 大量資料匯入工具
SQL Loader & Bulk Insert 大量資料匯入工具SQL Loader & Bulk Insert 大量資料匯入工具
SQL Loader & Bulk Insert 大量資料匯入工具
 

Recently uploaded

Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
Peter Spielvogel
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
Alex Pruden
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
Enhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZEnhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZ
Globus
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Nexer Digital
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
nkrafacyberclub
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
Vlad Stirbu
 

Recently uploaded (20)

Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Enhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZEnhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZ
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
 

Tensorflow + Keras & Open AI Gym

  • 1. PLAYING ATARI WITH DEEP REINFORCEMENT LEARNING TENSORFLOW + KERAS & OPENAI GYM 1
  • 2. CONTENTS Playing Atari Deep Reinforcement Learning  Playing Atari with Deep Reinforcement Learning  Human Level Control through Deep Reinforcement Learning  Deep Reinforcement Learning with Q-Learning 2
  • 3. PLAYING ATARI WITH DEEP REINFORCEMENT LEARNING 3
  • 4. ATARI 2600 http://atariage.com/index.php Atari 2600是1976年發行的經典遊戲主 機  史上第一部家用電子遊戲機  支援160 X 192解析度螢幕,最高128色,主機上 有 128 Byte RAM和 6KB ROM  FC 紅白機十年之後才出現 4
  • 5. PLAYING ATARI WITH DEEP REINFORCEMENT LEARNING DeepMind Object is to find an optimal policy  展示了如何讓電腦學習玩 Atari 2600 遊戲  這個結果引人注目的地方在於電腦只觀察螢幕圖 元並在遊戲得分增加時接收獎勵  相同模型架構  學習七種不同遊戲  其中三個遊戲玩得比人類好 5
  • 6. HUMAN LEVEL Original Results on Atari Games Beating Human Level 6
  • 7. A3C (ASYNCHRONOUS ADVANTAGE ACTOR-CRITIC) RESULTS ON ATARI GAMES 7
  • 9. PLAYING ATARI WITH DEEP REINFORCEMENT LEARNING Reinforcement Learning Object is to find an optimal policy 1. Given Current State 2. Take an Action based on state 3. Get current Reward 9
  • 10. BREAKOUT Tested on Ubuntu 16.04 Breakout  State  球在螢幕上的位置  Action  訓練電腦玩遊戲  Input:螢幕截圖  Outpu:控制Paddle左、右、發球  Reward  螢幕上半部分有很多磚塊,球碰到磚塊會將它擊碎,你 會得分 10
  • 11. RESOURCES Playing Atari with Deep Reinforcement Learning  https://courses.cs.ut.ee/MTAT.03.291/2014_sprin g/uploads/Main/Replicating%20DeepMind.pdf Replicating-DeepMind  https://github.com/kristjankorjus/Replicating- DeepMind 11
  • 12. RESOURCES DeepMind Atari Deep Q Learner  https://github.com/kuz/DeepMind-Atari-Deep- Q-Learner  LuaJIT and Torch 7.0  nngraph  Xitari (fork of the Arcade Learning Environment (Bellemare et al., 2013))  AleWrap (a lua interface to Xitari) An install script for these dependencies is provided. Asyncronous RL in Tensorflow + Keras OpenAI's Gym  https://github.com/coreylynch/async-rl  tensorflow  gym  [gym's atari environment] (https://github.com/openai/gym#atari)  skimage  Keras 12
  • 13. RESOURCES The Arcade Learning Environment  http://www.arcadelearningenvironment.org/ ALE (Visual Studio Version)  https://github.com/mvacha/A.L.E.-0.4.4.-Visual- Studio 13
  • 14. APT-GET INSTALL  libtiff5-dev  libjpeg8-dev  zlib1g-dev  liblcms2-dev  libwebp-dev  tcl8.6-dev  tk8.5-dev  python-tk  cmake  xvfb 14
  • 15. DEEP NEURAL NETWORKS  Tensor Flow is a good flexible deep learning framework  Backpropagation and deep neural network do a lot the reinforcement learning challenge is how you find the best loss function to train 15
  • 16. HOW TO RUN AI AGENTS ON GAMES? https://gym.openai.com/ OpenAI Gym  Library of Environments  Pong  Breakout  Cart-Pole  Same API  Provides way to share and compare results 16
  • 17. HOW TO RUN AI AGENTS ON GAMES? https://gym.openai.com/ Pip install -e '.[atari]' import gym env = gym.make('SpaceInvaders-v0') obs = env.reset() env.render() ob, reward, done, _ = env.step(action) 17
  • 18. OTHER OPTIONS https://github.com/DanielSlater/PyGamePla yer PyGame  1000’s of games  Easy to change game code  PyGamePlayer  Half pong 18
  • 19. PYTHON ASYNC_DQN.PY --EXPERIMENT BREAKOUT --GAME "BREAKOUT-V0" --NUM_CONCURRENT 8 Checkpoints /tmp/checkpoints/ TensorBoard Summary tensorboard --logdir /tmp/summaries/breakout "created":1485854183, "episode_types":["t"], "episode_lengths":[1717], "object":"episode_batch", "initial_reset_timestamps":[ 1485853848.3293480873], "episode_rewards":[62.0], "data_sources":[0], "seeds":[], "main_seeds":[], "timestamps":[1485853853. 9296009541], "env_id":"Breakout-v0", "initial_reset_timestamp":1 485853848.3293480873, "id":"eb_taFBJqLFThuZ5jBw O0NFTQ" tensorboard --logdir /tmp/summaries/breakout 19
  • 20. ALE GRAYSCALE CONVERSION METHOD RGB images grayscale conversion 20
  • 21. SCREENSHOT frame skipping maximum over two consecutive frames 21
  • 22. 100-EPISODE (2 HOURS) AVERAGE REWARD WAS 68.97 Training episode batch video (mp4) Visualizing training with tensorboard 22
  • 23. VISUALIZING TRAINING WITH TENSORBOARD Episode Reward Max Q Value 23
  • 25. MARKOV DECISION PROCESS  選擇這些行動的策略  一般來說環境是隨機的  下一個狀態的出現也是隨機的  MDP < S, A, P, R, 𝛾 >  S: set of states  A: set of actions  T(s, a, s’): probability of transition  Reward(s): reward function  𝛾: discounting factory  Trace: {<s0,a0,r0>, …, <sn,an,rn>} 25
  • 27. REINFORCEMENT LEARNING 3 categories of reinforcement learning  Value learning : Q-learning  給定一個狀態和一組可能的行動,決定採取最佳的 獎勵的行動  Policy learning : Policy gradients  使用Gradients找到最佳策略  Model learning  學習在不同狀態間的轉換  Min-Max  Monte-Carlo sampling Definitions  Return: total discounted reward:  Policy: Agent’s behavior  Deterministic policy: π(s) = a  Stochastic policy: π(a | s) = P[At = a | St = s]  Value function: Expected return starting from state s:  State-value function: Vπ(s) = Eπ[R | St = s]  Action-value function: Qπ(s, a) = Eπ[R | St = s, At = a] 27
  • 28. LEARNING Deep Q Learning  Model-free, off-policy technique to learn optimal Q(s, a):  Qi+1(s, a) ← Qi(s, a) + 𝛼(R + 𝛾 maxa’ Qi(s’, a’) - Qi(s, a))  Optimal policy then π(s) = argmaxa’ Q(s, a’)  Requires exploration (ε-greedy) to explore various transitions from the states.  Take random action with ε probability, start ε high and decay to low value as training progresses.  Deep Q Learning: approximate Q(s, a) with neural network: Q(s, a, 𝜃)  Do stochastic gradient descent using loss  L(𝜃) = MSEs, a(Q(s, a, 𝜃i), r + 𝛾maxa’Q(s, a’, 𝜃i - 1)) Policy Gradient  Given policy π𝜃(a | s) find such 𝜃 that maximizes expected return:  J(𝜃) = ∑sdπ(s)V(s)  In Deep RL, we approximate π𝜃(a | s) with neural network.  Usually with softmax layer on top to estimate probabilities of each action.  We can estimate J(𝜃) from samples of observed behavior: ∑k=0..Tp𝜃(𝜏k | π)R(𝜏k)  Do stochastic gradient descent using update:  𝜃i+1 = 𝜃i + 𝛼 (1/T) ∑k=0..T ∇log p𝜃(𝜏k | π)R(𝜏k) 28
  • 30. ASYNC ADVANTAGE ACTOR-CRITIC (A3C)  Asynchronous: using multiple instances of environments and networks  Actor-Critic: using both policy and estimate of value function.  Advantage: estimate how different was outcome than expected. 30
  • 34. Q-NETWORK Q-network Layer  Convolutional Layer  16 個 8 x 8 ,輸出採樣間隔為 4 x 4,並加 ReLU 非線性啟動函數  32 個 4 x 4 ,輸出採樣間隔為 2 x 2,並加 ReLU 非線性啟動函數  Flatten  將回應展開為一維向量  Fully-Connected Layer  256 個神經元,加 ReLU 非線性啟動函數  num_actions 個神經元,加線性啟動函數,對應每個 action 的 score 值(稱為 Q 值)  Pooling Layer  none 34
  • 36. POLICY NETWORK Policy Network Optimization 36
  • 37. POLICY AND VALUE AND POLICY NETWORKS Networks optimization 37
  • 38. PROBLEM temporal credit assignment  時間效益分配  先前的行動會影響到當前的收益的獲得  動作的先後影響力  experience replay  所有的經驗<P,A,R‘,S’>都存放在一個資料表 balance exploration-exploit  平衡行動  利用已有的策略  還是探索其他可能更好的策略  greedy exploration  按照最高的Q Value進行貪心行動  機率選擇一個隨機行動 38