Tensorflow + Keras & Open AI Gym

PLAYING ATARI WITH DEEP REINFORCEMENT LEARNING
TENSORFLOW + KERAS & OPENAI GYM
1

CONTENTS
Playing Atari Deep Reinforcement Learning
 Playing Atari with Deep Reinforcement Learning
 Human Level Control through Deep
Reinforcement Learning
 Deep Reinforcement Learning with Q-Learning
2

PLAYING ATARI WITH DEEP REINFORCEMENT
LEARNING
3

ATARI 2600
http://atariage.com/index.php
Atari 2600是1976年發行的經典遊戲主
機
 史上第一部家用電子遊戲機
 支援160 X 192解析度螢幕，最高128色，主機上
有 128 Byte RAM和 6KB ROM
 FC 紅白機十年之後才出現
4

DeepMind Object is to find an optimal policy
 展示了如何讓電腦學習玩 Atari 2600 遊戲
 這個結果引人注目的地方在於電腦只觀察螢幕圖
元並在遊戲得分增加時接收獎勵
 相同模型架構
 學習七種不同遊戲
 其中三個遊戲玩得比人類好
5

HUMAN LEVEL
Original Results on Atari Games Beating Human Level
6

A3C （ASYNCHRONOUS ADVANTAGE ACTOR-CRITIC）
RESULTS ON ATARI GAMES
7

Reinforcement Learning Object is to find an optimal policy
1. Given Current State
2. Take an Action based on state
3. Get current Reward
9

BREAKOUT
Tested on Ubuntu 16.04 Breakout
 State
 球在螢幕上的位置
 Action
 訓練電腦玩遊戲
 Input：螢幕截圖
 Outpu：控制Paddle左、右、發球
 Reward
 螢幕上半部分有很多磚塊，球碰到磚塊會將它擊碎，你
會得分
10

RESOURCES
Playing Atari with Deep
Reinforcement Learning
 https://courses.cs.ut.ee/MTAT.03.291/2014_sprin
g/uploads/Main/Replicating%20DeepMind.pdf
Replicating-DeepMind
 https://github.com/kristjankorjus/Replicating-
DeepMind
11

RESOURCES
DeepMind Atari Deep Q Learner
 https://github.com/kuz/DeepMind-Atari-Deep-
Q-Learner
 LuaJIT and Torch 7.0
 nngraph
 Xitari (fork of the Arcade Learning Environment
(Bellemare et al., 2013))
 AleWrap (a lua interface to Xitari) An install script
for these dependencies is provided.
Asyncronous RL in Tensorflow + Keras
OpenAI's Gym
 https://github.com/coreylynch/async-rl
 tensorflow
 gym
 [gym's atari environment]
(https://github.com/openai/gym#atari)
 skimage
 Keras
12

RESOURCES
The Arcade Learning Environment
 http://www.arcadelearningenvironment.org/
ALE (Visual Studio Version)
 https://github.com/mvacha/A.L.E.-0.4.4.-Visual-
Studio
13

APT-GET INSTALL
 libtiff5-dev
 libjpeg8-dev
 zlib1g-dev
 liblcms2-dev
 libwebp-dev
 tcl8.6-dev
 tk8.5-dev
 python-tk
 cmake
 xvfb
14

DEEP NEURAL NETWORKS
 Tensor Flow is a good flexible deep learning
framework
 Backpropagation and deep neural network do a
lot the reinforcement learning challenge is how
you find the best loss function to train
15

HOW TO RUN AI AGENTS ON GAMES?
https://gym.openai.com/ OpenAI Gym
 Library of Environments
 Pong
 Breakout
 Cart-Pole
 Same API
 Provides way to share and compare results
16

HOW TO RUN AI AGENTS ON GAMES?
https://gym.openai.com/ Pip install -e '.[atari]'
import gym
env = gym.make('SpaceInvaders-v0')
obs = env.reset()
env.render()
ob, reward, done, _ = env.step(action)
17

OTHER OPTIONS
https://github.com/DanielSlater/PyGamePla
yer PyGame
 1000’s of games
 Easy to change game code
 PyGamePlayer
 Half pong
18

PYTHON ASYNC_DQN.PY --EXPERIMENT BREAKOUT --GAME
"BREAKOUT-V0" --NUM_CONCURRENT 8
Checkpoints
/tmp/checkpoints/
TensorBoard Summary
tensorboard --logdir
/tmp/summaries/breakout
"created":1485854183,
"episode_types":["t"],
"episode_lengths":[1717],
"object":"episode_batch",
"initial_reset_timestamps":[
1485853848.3293480873],
"episode_rewards":[62.0],
"data_sources":[0],
"seeds":[],
"main_seeds":[],
"timestamps":[1485853853.
9296009541],
"env_id":"Breakout-v0",
"initial_reset_timestamp":1
485853848.3293480873,
"id":"eb_taFBJqLFThuZ5jBw
O0NFTQ"
tensorboard --logdir /tmp/summaries/breakout
19

ALE GRAYSCALE CONVERSION METHOD
RGB images grayscale conversion
20

SCREENSHOT
frame skipping maximum over two consecutive frames
21

100-EPISODE (2 HOURS) AVERAGE REWARD WAS 68.97
Training episode batch video (mp4) Visualizing training with tensorboard
22

VISUALIZING TRAINING WITH TENSORBOARD
Episode Reward Max Q Value
23

MARKOV DECISION PROCESS
 選擇這些行動的策略
 一般來說環境是隨機的
 下一個狀態的出現也是隨機的
 MDP < S, A, P, R, 𝛾 >
 S: set of states
 A: set of actions
 T(s, a, s’): probability of transition
 Reward(s): reward function
 𝛾: discounting factory
 Trace: {<s0,a0,r0>, …, <sn,an,rn>}
25

Convolutional networks Network architecture
26

REINFORCEMENT LEARNING
3 categories of reinforcement learning
 Value learning : Q-learning
 給定一個狀態和一組可能的行動，決定採取最佳的
獎勵的行動
 Policy learning : Policy gradients
 使用Gradients找到最佳策略
 Model learning
 學習在不同狀態間的轉換
 Min-Max
 Monte-Carlo sampling
Definitions
 Return: total discounted reward:
 Policy: Agent’s behavior
 Deterministic policy: π(s) = a
 Stochastic policy: π(a | s) = P[At = a | St = s]
 Value function: Expected return starting from
state s:
 State-value function: Vπ(s) = Eπ[R | St = s]
 Action-value function: Qπ(s, a) = Eπ[R | St = s, At =
a]
27

LEARNING
Deep Q Learning
 Model-free, off-policy technique to learn optimal Q(s, a):
 Qi+1(s, a) ← Qi(s, a) + 𝛼(R + 𝛾 maxa’ Qi(s’, a’) - Qi(s, a))
 Optimal policy then π(s) = argmaxa’ Q(s, a’)
 Requires exploration (ε-greedy) to explore various transitions from the
states.
 Take random action with ε probability, start ε high and decay to low
value as training progresses.
 Deep Q Learning: approximate Q(s, a) with neural network: Q(s, a, 𝜃)
 Do stochastic gradient descent using loss
 L(𝜃) = MSEs, a(Q(s, a, 𝜃i), r + 𝛾maxa’Q(s, a’, 𝜃i - 1))
Policy Gradient
 Given policy π𝜃(a | s) find such 𝜃 that maximizes
expected return:
 J(𝜃) = ∑sdπ(s)V(s)
 In Deep RL, we approximate π𝜃(a | s) with neural
network.
 Usually with softmax layer on top to estimate
probabilities of each action.
 We can estimate J(𝜃) from samples of observed
behavior: ∑k=0..Tp𝜃(𝜏k | π)R(𝜏k)
 Do stochastic gradient descent using update:
 𝜃i+1 = 𝜃i + 𝛼 (1/T) ∑k=0..T ∇log p𝜃(𝜏k | π)R(𝜏k)
28

ASYNC ADVANTAGE ACTOR-CRITIC (A3C)
 Asynchronous: using multiple instances of
environments and networks
 Actor-Critic: using both policy and estimate of
value function.
 Advantage: estimate how different was outcome
than expected.
30

TENSORFLOW-RL/EXAMPLES/ATARI-RL.PY
31

ACTING
Environment Random Agent
32

Q-NETWORK
Q-network Optimization
33

Q-NETWORK
Q-network Layer
 Convolutional Layer
 16 個 8 x 8 ，輸出採樣間隔為 4 x 4，並加 ReLU 非線性啟動函數
 32 個 4 x 4 ，輸出採樣間隔為 2 x 2，並加 ReLU 非線性啟動函數
 Flatten
 將回應展開為一維向量
 Fully-Connected Layer
 256 個神經元，加 ReLU 非線性啟動函數
 num_actions 個神經元，加線性啟動函數，對應每個 action 的 score
值（稱為 Q 值）
 Pooling Layer
 none
34

Q-NETWORK
Q-network Monitored Training Session
35

POLICY NETWORK
Policy Network Optimization
36

POLICY AND VALUE AND POLICY NETWORKS
Networks optimization
37

PROBLEM
temporal credit assignment
 時間效益分配
 先前的行動會影響到當前的收益的獲得
 動作的先後影響力
 experience replay
 所有的經驗<P，A，R‘，S’>都存放在一個資料表
balance exploration-exploit
 平衡行動
 利用已有的策略
 還是探索其他可能更好的策略
 greedy exploration
 按照最高的Q Value進行貪心行動
 機率選擇一個隨機行動
38

Tensorflow + Keras & Open AI Gym

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Tensorflow + Keras & Open AI Gym

Similar to Tensorflow + Keras & Open AI Gym (20)

More from HO-HSUN LIN

More from HO-HSUN LIN (7)

Recently uploaded

Recently uploaded (20)

Tensorflow + Keras & Open AI Gym