SlideShare a Scribd company logo
Deep Reinforcement Learning
from scratch
Jie-Han Chen
NetDB, National Cheng Kung University
1
The content and images in this slides were borrowed from:
1. Rich Sutton’s textbook
2. David Silver’s Reinforcement Learning class in UCL
3. Sergey Levine’s Deep Reinforcement Learning class in UCB
4. Deep Reinforcement Learning and Control in CMU (CMU 10703)
2
Disclaimer
Outline
3
1. Introduction RL and MDP
2. Q-Learning
3. Deep Q Network
4. Discussion
Introduction - RL
4Figure from Sutton & Barto, RL textbook
Reinforcement Learning V.S Supervised Learning
Supervised Learning:
Input data is independent. Current
output will not affect next input data.
5
Reinforcement Learning V.S Supervised Learning
Reinforcement Learning:
The agent’s action affect the data it
will receive in the future. (from CMU
10703)
6Figure from Wikipedia, made by waldoalvarez
When do we use
Reinforcement Learning?
7
If the problem can be modeled as MDP,
we can try RL to solve it!
8
Type of RL task
1. Episodic Task: the task will terminate after number of steps.
eg: Game, Chess
2. Continuous Task: the task never terminate.
9
Markov Decision Process
Defined by:
1. S: set of states
2. A: set of actions
3. R: reward model R(s)/ R(s, a)/ R(s, a, s’)
4. P: dynamics of model and its transition probability
5. : The discounted factor
10
Define agent-environment boundary
Before defining the set of state, we should define the
boundary between agent and environment.
According to Richard Sutton’s textbook:
1. “The agent-environment boundary represents the
limit of the agent’s absolute control, not of its
knowledge.”
2. “The general rule we follow is that anything
cannot be changed arbitrarily by the agent is
considered to be outside of it and thus part of its
environment.”
11
Define agent-environment boundary
Before defining the set of state, we should define the
boundary between agent and environment.
According to Richard Sutton’s textbook:
1. “The agent-environment boundary represents the
limit of the agent’s absolute control, not of its
knowledge.”
2. “The general rule we follow is that anything
cannot be changed arbitrarily by the agent is
considered to be outside of it and thus part of its
environment.”
12
Markov Property
● A state is Markov if and only if
● A state should summarize past sensation so as to retain all “essential”
information.
● We should be able to throw away the history once state is known.
13from CMU 10703
Define State (Observation)
14
Atari 2600: Space Invaders Go
Define Action
1. Discrete Action Space
2. Continuous Action Space
15Atari 2600: Breakout Robotic Arm
Markov Decision Process
Defined by:
1. S: set of states ✓
2. A: set of actions ✓
3. R: reward model R(s)/ R(s, a)/ R(s, a, s’)
4. P: dynamics model and its transition probability
5. : The discounted factor
16
Define Rewards
Rewards specify WHAT the agent needs to achieve, NOT HOW to achieve it.
17
S: start state
G: Goal
Define Rewards
Rewards specify WHAT the agent needs to achieve, NOT HOW to achieve it.
18
Markov Decision Process
Defined by:
1. S: set of states ✓
2. A: set of actions ✓
3. R: reward model R(s)/ R(s, a)/ R(s, a, s’) ✓
4. P: dynamics of model and its transition probability
5. : The discounted factor
19
Markov Decision Process
Definition: A policy is a distribution over actions given states,
MDP policies depend on the current state (time-independent)
20
Markov Decision Process
The objective in RL is to maximize long-trem future reward
Definition: The return is the total discounted reward from timestep t
In episodic tasks, we can consider undiscounted future wards
21
Markov Decision Process
Definition: The state-value function of an MDP is the expected return
starting from state s, and then following policy
Definition: The action-value function is the expected return starting from
state s, taking action a, and then following policy
22
Bellman Expected Equation
The state-value function can be decomposed into immediate reward plus
discounted value of successor state,
The action-value function can similarly be decomposed,
23
Optimal Value Functions
Definition: The optimal state-value function is the maximum value function
over all policies.
Definition: The optimal action-value function is the maximum
action-value function over all policies.
24
We can use backup diagram to explain the relationship between and
, and how to update each other.
Backup Diagram
25
We can use backup diagram to explain the relationship between and
, and how to update each other.
Backup Diagram
26
Backup Diagram
27
Backup Diagram
28
Optimal Policy
if
Theorem: For any Markov Decision Process
1. There exists an optimal policy that is better than or equal to all other
policies,
2. All optimal policies achieve the optimal value function,
3. All optimal policies achieve the optimal action-value function,
29
How to get Optimal Policies?
An optimal policy can be found by maximizing over
There is always a deterministic optimal policy for and MDP
If we know , we immediately have the optimal policy.
30
Optimal action-value function in MDP
31
Solving Markov Decision Process
● Find the optimal policy
● Prediction: for a given policy, estimate value functions of state and state-action
pairs.
● Control: estimate the value function of state and state-action pairs for the
optimal policy.
32
Solving the Bellman Optimality Equation
Equation requires the following:
1. accurate knowledge of environments dynamics
2. we have enough space and time to do the computation
3. the Markov Property
33
Markov Decision Process
Defined by:
1. S: set of states ✓
2. A: set of actions ✓
3. R: reward model R(s)/ R(s, a)/ R(s, a, s’) ✓
4. P: dynamics of model and its transition probability
5. : The discounted factor
34
Outline
35
1. Introduction RL and MDP
2. Q-Learning
3. Deep Q Network
4. Discussion
The category of RL
36
Value-based: select action
according to value function,
SGD on Bellman Error.
Policy-based: using SGD directly on
discounted expected return with
policy
Model-based: Learning the mode from interact with
environment or simulate trajectory to estimate environment
model. eg: Dyna, MCTS
The category of RL
Model-based method: Learn the model of the MDP (transition probability and
rewards) and try to solve MDP concurrently.
Model-free method: Learn how to act without explicitly learning the transition
probability
37
Model-free Reinforcement Learning
Using sample backup, sample transition experience:
38
Q-Learning
Proposed by Watkins, 1989
● A model-free algorithm
● Tabular method: using large table to save each action-value pair Q(s, a)
● Learn from one step experience:
● Off-policy
● Online learning
Update Q table:
39
Q-Learning
Learning by sample:
Update Q:
bootstrapping: using the estimate of the return as the target to update old value
function.
40
Target
an estimate of the return
step size
Q-Learning
Update Q table:
41
Q-Learning v.s. Optimal MDP
Q-Learning:
Optimal MDP:
42
Off-policy
Off-policy: If the agent learn the policy from the experience, which was generated
by other policy (not current policy), we call this algorithm is off-policy.
Why Q-Learning is off-policy?
● given experience:
● update Q:
43
On-policy
The agent can only learn the policy from the experience, which was generated by
current policy. If the experience is not generated by current policy, the learning
process won’t converge.
44
But there is still a problem
If we use optimal policy at all times, most of Q table
won’t be updated, and we will found the policy NOT
OPTIMAL.
45
Exploration v.s Exploitation
Exploration: gather more information
Exploitation: make the best decision given current information
Q-Learing use strategy:
● With probability , select
● With probability , select a random action.
46
Q-Learning Algorithm
47
Q-Learning Algorithm
Tabular method needs tremendous memory to store action-value pair, when facing
large/high dimensional state space it suffers from the curse of dimensionality.
Only can be used in discrete action task. Because of it select the optimal action by
48
We need Function Approximator!
49
Function Approximator
There are many kinds of function approximator:
● Linear combination of features
● Neural networks
● Decision Tree
● Nearest neighbour
● Fourier/wavlet bases
● ...
50
Function Approximator
51
Deep Q Network
1. Proposed by V Mnih, K Kavukcuoglu, David Silver et al., DeepMind [1][2]
2. Using neural network as non-linear function approximator
3. DQN = Q-Learning + Deep Network
4. Testbed: 49 Atari Game
52
[1]V Mnih et al., Playing Atari with Deep Reinforcement Learning
[2]V Mnih et al., Human-level control through deep reinforcement learning (2015 Nature)
Deep Q Network - Define MDP
Is it an episodic task or continuous task?
Is the action space discrete or continuous?
How to define state? Is it Markov?
How to define rewards?
53
Deep Q Network - Define MDP
1. The game is episodic task
a. if there are multiple lives each game, they define terminal state when losing a life.
2. The action space is discrete
3. They using multi-frame as state, 4-frame here. Because of the object motion
cannot be detected by only 1 frame. 1-frame state is not Markov.
4. Clip the rewards between [-1, 1]
a. limit the scale of error derivatives
b. make it easier to use the same learning rate across multiple games
54
Deep Q Network - State in details
1. The origin screen size is 210x160x3 (RGB)
2. They transformed the origin screen into Grayscale (210x160x1)
3. Resize the screen size to 84x84 to train faster
4. Stack the nearest 4 screen frame together as its state
55
Deep Q Network - Architecture
56
DQN !
Deep Q Network - Architecture (2013)
1. 2 Convolutional neural network
a. 16 filters, 8x8 each with 4 stride
b. 32 filters, 4x4 each with 2 stride
2. 2 Fully Connected network
a. flatten to 256 neurons
b. 256 to # of actions (output layer)
57
3. Without:
a. pooling
b. batch normalization
c. dropout
Deep Q Network - Architecture (2015)
1. 3 Convolutional neural network
a. 32 filters, 8x8 each with 4 stride
b. 64 filters, 4x4 each with 2 stride
c. 64 filters, 3x3 each with 1 stride
2. 2 Fully Connected network
a. flatten to 512 neurons
b. 512 to # of actions (output layer)
58
3. Again without:
a. pooling
b. batch normalization
c. dropout
Deep Q Network - preliminary summary
Currently, we have:
1. Markov Decision Process
2. Non-linear function approximator to estimate
we can apply to random control.
But, we want our agent performs better and better.
59
Deep Q Network - Algorithm
In previous slides, we define optimal
action-value function in MDP.
which was:
we can iteratively update action-value by:
when , which means it
converge.
60
Deep Q Network - Algorithm
However, because we estimate the
action-value by non-linear function
approximator, we cannot directly update the
action-value by the formula (in right hand
side).
It just works in linear function approximator.
61
Deep Q Network - Algorithm
The good news: in neural network, we can use Stochastic Gradient Descent
(SGD) to approach Q* (a estimate, not equal)
In supervised learning, we often model this problem as an regression problem.
eg:
62
is weights of neural network in
iteration i
target
Deep Q Network - Algorithm
recap: the concept of neural network in supervised learning, the target is fixed! The
fixed target doesn’t need to gradient.
How to fix it?
63
Deep Q Network - Algorithm
Using seperated network to fix SGD:
● evaluation network: to estimate current action-value
● target network: as an fixed target.
We initialize target network using the same weights as evaluation network
The gradient of Loss function:
64
Deep Q Network - Algorithm
Update neural weights
65
Deep Q Network - Algorithm
We use online-learning in DQN, just like Q-learning:
step1: we observe the environement, get observation
step2: we take the action according to current observation
step3: update the neural weights
66
This is called sampling,
sample experience
(s, a, r, s’)
Wait, there still exists another problem!
67
Correlation
68
Deep Q Network - Algorithm
There still exist another problem -- correlation.
They use experience replay to solve it!
69
Deep Q Network - Algorithm
Experience replay: when the agent iteract with environment with policy
it will store transition experience (s, a, r, s’) in replay buffer.
When learning with SGD, the agent sample batch-experience from replay buffer,
learning batch by batch.
70
71
Experiment settings
SGD optimizer: RMSProp
Learning rate: 2.5e-4 (0.00025)
batch size: 32
Loss function: MSE Loss, clip loss within [-1, 1]
Decay epsilon (exploration rate) from 1.0 to 0.1 in 1M steps
72
Deep Q Network - Result
The human performance is the average reward
achieved from around 20 episodes of each game
lasting a maximum of 5 min each, following around
2 h of practice playing each game.
73
You can see the figure at p.3:
https://storage.googleapis.com/deepmind-media/dqn/DQNNat
urePaper.pdf
Experiments on Space Invaders (Atari2600)
74
Space Invaders
1. We have 3 lives (episodic task)
2. We also have 3 Shields
3. Need to beat all Invaders
4. The bullets blink with some frequency
75
Huber Loss
76
DQN: Huber Loss + Adam
Learning rate: 2.5e-4
77
DQN: MSE clamp loss + RMSProp
Learning rate: 2.5e-4
Total steps: 1e+7
78
DQN: MSE clamp loss + RMSProp
79
DQN: Paper settings
● MSE Loss, clamp loss within [-1, 1]
● using RMSProp as optimizer, LR=2.5e-4
80
750 !!
DQN: Huber Loss (without clamp loss) + RMSProp
81
DQN: MSE clamp loss + Adam
82
DQN: Huber Loss + Adam
83
The content not covered in this slides
The Proof of convergence of linear function approximator & non-linear function
approximator, but you can find it in Rich Sutton’s text book in Ch9 - Ch11.
84

More Related Content

What's hot

Temporal difference learning
Temporal difference learningTemporal difference learning
Temporal difference learning
Jie-Han Chen
 
Actor critic algorithm
Actor critic algorithmActor critic algorithm
Actor critic algorithm
Jie-Han Chen
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
Muhammad Iqbal Tawakal
 
Activation function
Activation functionActivation function
Activation function
RakshithGowdakodihal
 
Deep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-LearningDeep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-Learning
Kai-Wen Zhao
 
Markov decision process
Markov decision processMarkov decision process
Markov decision process
Jie-Han Chen
 
Reinforcement Learning Q-Learning
Reinforcement Learning   Q-Learning Reinforcement Learning   Q-Learning
Reinforcement Learning Q-Learning
Melaku Eneayehu
 
Planning and Learning with Tabular Methods
Planning and Learning with Tabular MethodsPlanning and Learning with Tabular Methods
Planning and Learning with Tabular Methods
Dongmin Lee
 
Proximal Policy Optimization (Reinforcement Learning)
Proximal Policy Optimization (Reinforcement Learning)Proximal Policy Optimization (Reinforcement Learning)
Proximal Policy Optimization (Reinforcement Learning)
Thom Lane
 
강화학습 해부학 교실: Rainbow 이론부터 구현까지 (2nd dlcat in Daejeon)
강화학습 해부학 교실: Rainbow 이론부터 구현까지 (2nd dlcat in Daejeon)강화학습 해부학 교실: Rainbow 이론부터 구현까지 (2nd dlcat in Daejeon)
강화학습 해부학 교실: Rainbow 이론부터 구현까지 (2nd dlcat in Daejeon)
Kyunghwan Kim
 
Imitation learning tutorial
Imitation learning tutorialImitation learning tutorial
Imitation learning tutorial
Yisong Yue
 
An introduction to reinforcement learning
An introduction to  reinforcement learningAn introduction to  reinforcement learning
An introduction to reinforcement learning
Jie-Han Chen
 
Reinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialReinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners Tutorial
Omar Enayet
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
CloudxLab
 
Proximal Policy Optimization
Proximal Policy OptimizationProximal Policy Optimization
Proximal Policy Optimization
ShubhaManikarnike
 
Reinforcement learning 7313
Reinforcement learning 7313Reinforcement learning 7313
Reinforcement learning 7313
Slideshare
 
Reinforcement Learning 6. Temporal Difference Learning
Reinforcement Learning 6. Temporal Difference LearningReinforcement Learning 6. Temporal Difference Learning
Reinforcement Learning 6. Temporal Difference Learning
Seung Jae Lee
 
Reinforcement Learning 3. Finite Markov Decision Processes
Reinforcement Learning 3. Finite Markov Decision ProcessesReinforcement Learning 3. Finite Markov Decision Processes
Reinforcement Learning 3. Finite Markov Decision Processes
Seung Jae Lee
 
Hierarchical Reinforcement Learning
Hierarchical Reinforcement LearningHierarchical Reinforcement Learning
Hierarchical Reinforcement Learning
ahmad bassiouny
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
Salem-Kabbani
 

What's hot (20)

Temporal difference learning
Temporal difference learningTemporal difference learning
Temporal difference learning
 
Actor critic algorithm
Actor critic algorithmActor critic algorithm
Actor critic algorithm
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Activation function
Activation functionActivation function
Activation function
 
Deep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-LearningDeep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-Learning
 
Markov decision process
Markov decision processMarkov decision process
Markov decision process
 
Reinforcement Learning Q-Learning
Reinforcement Learning   Q-Learning Reinforcement Learning   Q-Learning
Reinforcement Learning Q-Learning
 
Planning and Learning with Tabular Methods
Planning and Learning with Tabular MethodsPlanning and Learning with Tabular Methods
Planning and Learning with Tabular Methods
 
Proximal Policy Optimization (Reinforcement Learning)
Proximal Policy Optimization (Reinforcement Learning)Proximal Policy Optimization (Reinforcement Learning)
Proximal Policy Optimization (Reinforcement Learning)
 
강화학습 해부학 교실: Rainbow 이론부터 구현까지 (2nd dlcat in Daejeon)
강화학습 해부학 교실: Rainbow 이론부터 구현까지 (2nd dlcat in Daejeon)강화학습 해부학 교실: Rainbow 이론부터 구현까지 (2nd dlcat in Daejeon)
강화학습 해부학 교실: Rainbow 이론부터 구현까지 (2nd dlcat in Daejeon)
 
Imitation learning tutorial
Imitation learning tutorialImitation learning tutorial
Imitation learning tutorial
 
An introduction to reinforcement learning
An introduction to  reinforcement learningAn introduction to  reinforcement learning
An introduction to reinforcement learning
 
Reinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialReinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners Tutorial
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Proximal Policy Optimization
Proximal Policy OptimizationProximal Policy Optimization
Proximal Policy Optimization
 
Reinforcement learning 7313
Reinforcement learning 7313Reinforcement learning 7313
Reinforcement learning 7313
 
Reinforcement Learning 6. Temporal Difference Learning
Reinforcement Learning 6. Temporal Difference LearningReinforcement Learning 6. Temporal Difference Learning
Reinforcement Learning 6. Temporal Difference Learning
 
Reinforcement Learning 3. Finite Markov Decision Processes
Reinforcement Learning 3. Finite Markov Decision ProcessesReinforcement Learning 3. Finite Markov Decision Processes
Reinforcement Learning 3. Finite Markov Decision Processes
 
Hierarchical Reinforcement Learning
Hierarchical Reinforcement LearningHierarchical Reinforcement Learning
Hierarchical Reinforcement Learning
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 

Similar to Deep reinforcement learning from scratch

How to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysHow to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative ways
YasutoTamura1
 
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
MLconf
 
An efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningAn efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game Learning
Prabhu Kumar
 
Head First Reinforcement Learning
Head First Reinforcement LearningHead First Reinforcement Learning
Head First Reinforcement Learning
azzeddine chenine
 
Reinforcement learning
Reinforcement learning Reinforcement learning
Reinforcement learning
Chandra Meena
 
technical seminar2.pptx.on markov decision process
technical seminar2.pptx.on markov decision processtechnical seminar2.pptx.on markov decision process
technical seminar2.pptx.on markov decision process
mudavathnarasimhanai
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
DongHyun Kwak
 
Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona ...
Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona ...Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona ...
Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona ...
Universitat Politècnica de Catalunya
 
REINFORCEMENT LEARNING
REINFORCEMENT LEARNINGREINFORCEMENT LEARNING
REINFORCEMENT LEARNING
pradiprahul
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
Elias Hasnat
 
TensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and TricksTensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and Tricks
Ben Ball
 
Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018
Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018
Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018
Universitat Politècnica de Catalunya
 
Reinforcement Learning - DQN
Reinforcement Learning - DQNReinforcement Learning - DQN
Reinforcement Learning - DQN
Mohammaderfan Arefimoghaddam
 
Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...
Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...
Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...
Universitat Politècnica de Catalunya
 
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
The Statistical and Applied Mathematical Sciences Institute
 
Deep Reinforcement learning
Deep Reinforcement learningDeep Reinforcement learning
Deep Reinforcement learning
Cairo University
 
Reinfrocement Learning
Reinfrocement LearningReinfrocement Learning
Reinfrocement Learning
Natan Katz
 
Discrete sequential prediction of continuous actions for deep RL
Discrete sequential prediction of continuous actions for deep RLDiscrete sequential prediction of continuous actions for deep RL
Discrete sequential prediction of continuous actions for deep RL
Jie-Han Chen
 
reinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdfreinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdf
VaishnavGhadge1
 
Reinforcement Learning Guide For Beginners
Reinforcement Learning Guide For BeginnersReinforcement Learning Guide For Beginners
Reinforcement Learning Guide For Beginners
gokulprasath06
 

Similar to Deep reinforcement learning from scratch (20)

How to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysHow to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative ways
 
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
 
An efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningAn efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game Learning
 
Head First Reinforcement Learning
Head First Reinforcement LearningHead First Reinforcement Learning
Head First Reinforcement Learning
 
Reinforcement learning
Reinforcement learning Reinforcement learning
Reinforcement learning
 
technical seminar2.pptx.on markov decision process
technical seminar2.pptx.on markov decision processtechnical seminar2.pptx.on markov decision process
technical seminar2.pptx.on markov decision process
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona ...
Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona ...Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona ...
Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona ...
 
REINFORCEMENT LEARNING
REINFORCEMENT LEARNINGREINFORCEMENT LEARNING
REINFORCEMENT LEARNING
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
TensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and TricksTensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and Tricks
 
Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018
Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018
Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018
 
Reinforcement Learning - DQN
Reinforcement Learning - DQNReinforcement Learning - DQN
Reinforcement Learning - DQN
 
Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...
Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...
Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...
 
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
 
Deep Reinforcement learning
Deep Reinforcement learningDeep Reinforcement learning
Deep Reinforcement learning
 
Reinfrocement Learning
Reinfrocement LearningReinfrocement Learning
Reinfrocement Learning
 
Discrete sequential prediction of continuous actions for deep RL
Discrete sequential prediction of continuous actions for deep RLDiscrete sequential prediction of continuous actions for deep RL
Discrete sequential prediction of continuous actions for deep RL
 
reinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdfreinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdf
 
Reinforcement Learning Guide For Beginners
Reinforcement Learning Guide For BeginnersReinforcement Learning Guide For Beginners
Reinforcement Learning Guide For Beginners
 

More from Jie-Han Chen

Frontier in reinforcement learning
Frontier in reinforcement learningFrontier in reinforcement learning
Frontier in reinforcement learning
Jie-Han Chen
 
Temporal difference learning
Temporal difference learningTemporal difference learning
Temporal difference learning
Jie-Han Chen
 
Deep reinforcement learning
Deep reinforcement learningDeep reinforcement learning
Deep reinforcement learning
Jie-Han Chen
 
Multi armed bandit
Multi armed banditMulti armed bandit
Multi armed bandit
Jie-Han Chen
 
BiCNet presentation (multi-agent reinforcement learning)
BiCNet presentation (multi-agent reinforcement learning)BiCNet presentation (multi-agent reinforcement learning)
BiCNet presentation (multi-agent reinforcement learning)
Jie-Han Chen
 
Data science-toolchain
Data science-toolchainData science-toolchain
Data science-toolchain
Jie-Han Chen
 
The artofreadablecode
The artofreadablecodeThe artofreadablecode
The artofreadablecode
Jie-Han Chen
 

More from Jie-Han Chen (7)

Frontier in reinforcement learning
Frontier in reinforcement learningFrontier in reinforcement learning
Frontier in reinforcement learning
 
Temporal difference learning
Temporal difference learningTemporal difference learning
Temporal difference learning
 
Deep reinforcement learning
Deep reinforcement learningDeep reinforcement learning
Deep reinforcement learning
 
Multi armed bandit
Multi armed banditMulti armed bandit
Multi armed bandit
 
BiCNet presentation (multi-agent reinforcement learning)
BiCNet presentation (multi-agent reinforcement learning)BiCNet presentation (multi-agent reinforcement learning)
BiCNet presentation (multi-agent reinforcement learning)
 
Data science-toolchain
Data science-toolchainData science-toolchain
Data science-toolchain
 
The artofreadablecode
The artofreadablecodeThe artofreadablecode
The artofreadablecode
 

Recently uploaded

Updated diagnosis. Cause and treatment of hypothyroidism
Updated diagnosis. Cause and treatment of hypothyroidismUpdated diagnosis. Cause and treatment of hypothyroidism
Updated diagnosis. Cause and treatment of hypothyroidism
Faculty of Medicine And Health Sciences
 
Presentatie 4. Jochen Cremer - TU Delft 28 mei 2024
Presentatie 4. Jochen Cremer - TU Delft 28 mei 2024Presentatie 4. Jochen Cremer - TU Delft 28 mei 2024
Presentatie 4. Jochen Cremer - TU Delft 28 mei 2024
Dutch Power
 
Tom tresser burning issue.pptx My Burning issue
Tom tresser burning issue.pptx My Burning issueTom tresser burning issue.pptx My Burning issue
Tom tresser burning issue.pptx My Burning issue
amekonnen
 
Announcement of 18th IEEE International Conference on Software Testing, Verif...
Announcement of 18th IEEE International Conference on Software Testing, Verif...Announcement of 18th IEEE International Conference on Software Testing, Verif...
Announcement of 18th IEEE International Conference on Software Testing, Verif...
Sebastiano Panichella
 
XP 2024 presentation: A New Look to Leadership
XP 2024 presentation: A New Look to LeadershipXP 2024 presentation: A New Look to Leadership
XP 2024 presentation: A New Look to Leadership
samililja
 
Gregory Harris' Civics Presentation.pptx
Gregory Harris' Civics Presentation.pptxGregory Harris' Civics Presentation.pptx
Gregory Harris' Civics Presentation.pptx
gharris9
 
Supercharge your AI - SSP Industry Breakout Session 2024-v2_1.pdf
Supercharge your AI - SSP Industry Breakout Session 2024-v2_1.pdfSupercharge your AI - SSP Industry Breakout Session 2024-v2_1.pdf
Supercharge your AI - SSP Industry Breakout Session 2024-v2_1.pdf
Access Innovations, Inc.
 
María Carolina Martínez - eCommerce Day Colombia 2024
María Carolina Martínez - eCommerce Day Colombia 2024María Carolina Martínez - eCommerce Day Colombia 2024
María Carolina Martínez - eCommerce Day Colombia 2024
eCommerce Institute
 
Mastering the Concepts Tested in the Databricks Certified Data Engineer Assoc...
Mastering the Concepts Tested in the Databricks Certified Data Engineer Assoc...Mastering the Concepts Tested in the Databricks Certified Data Engineer Assoc...
Mastering the Concepts Tested in the Databricks Certified Data Engineer Assoc...
SkillCertProExams
 
Media as a Mind Controlling Strategy In Old and Modern Era
Media as a Mind Controlling Strategy In Old and Modern EraMedia as a Mind Controlling Strategy In Old and Modern Era
Media as a Mind Controlling Strategy In Old and Modern Era
faizulhassanfaiz1670
 
Suzanne Lagerweij - Influence Without Power - Why Empathy is Your Best Friend...
Suzanne Lagerweij - Influence Without Power - Why Empathy is Your Best Friend...Suzanne Lagerweij - Influence Without Power - Why Empathy is Your Best Friend...
Suzanne Lagerweij - Influence Without Power - Why Empathy is Your Best Friend...
Suzanne Lagerweij
 
Doctoral Symposium at the 17th IEEE International Conference on Software Test...
Doctoral Symposium at the 17th IEEE International Conference on Software Test...Doctoral Symposium at the 17th IEEE International Conference on Software Test...
Doctoral Symposium at the 17th IEEE International Conference on Software Test...
Sebastiano Panichella
 
Competition and Regulation in Professions and Occupations – OECD – June 2024 ...
Competition and Regulation in Professions and Occupations – OECD – June 2024 ...Competition and Regulation in Professions and Occupations – OECD – June 2024 ...
Competition and Regulation in Professions and Occupations – OECD – June 2024 ...
OECD Directorate for Financial and Enterprise Affairs
 
Competition and Regulation in Professions and Occupations – ROBSON – June 202...
Competition and Regulation in Professions and Occupations – ROBSON – June 202...Competition and Regulation in Professions and Occupations – ROBSON – June 202...
Competition and Regulation in Professions and Occupations – ROBSON – June 202...
OECD Directorate for Financial and Enterprise Affairs
 
Collapsing Narratives: Exploring Non-Linearity • a micro report by Rosie Wells
Collapsing Narratives: Exploring Non-Linearity • a micro report by Rosie WellsCollapsing Narratives: Exploring Non-Linearity • a micro report by Rosie Wells
Collapsing Narratives: Exploring Non-Linearity • a micro report by Rosie Wells
Rosie Wells
 
Gregory Harris - Cycle 2 - Civics Presentation
Gregory Harris - Cycle 2 - Civics PresentationGregory Harris - Cycle 2 - Civics Presentation
Gregory Harris - Cycle 2 - Civics Presentation
gharris9
 
Bonzo subscription_hjjjjjjjj5hhhhhhh_2024.pdf
Bonzo subscription_hjjjjjjjj5hhhhhhh_2024.pdfBonzo subscription_hjjjjjjjj5hhhhhhh_2024.pdf
Bonzo subscription_hjjjjjjjj5hhhhhhh_2024.pdf
khadija278284
 
2024-05-30_meetup_devops_aix-marseille.pdf
2024-05-30_meetup_devops_aix-marseille.pdf2024-05-30_meetup_devops_aix-marseille.pdf
2024-05-30_meetup_devops_aix-marseille.pdf
Frederic Leger
 
International Workshop on Artificial Intelligence in Software Testing
International Workshop on Artificial Intelligence in Software TestingInternational Workshop on Artificial Intelligence in Software Testing
International Workshop on Artificial Intelligence in Software Testing
Sebastiano Panichella
 
ASONAM2023_presection_slide_track-recommendation.pdf
ASONAM2023_presection_slide_track-recommendation.pdfASONAM2023_presection_slide_track-recommendation.pdf
ASONAM2023_presection_slide_track-recommendation.pdf
ToshihiroIto4
 

Recently uploaded (20)

Updated diagnosis. Cause and treatment of hypothyroidism
Updated diagnosis. Cause and treatment of hypothyroidismUpdated diagnosis. Cause and treatment of hypothyroidism
Updated diagnosis. Cause and treatment of hypothyroidism
 
Presentatie 4. Jochen Cremer - TU Delft 28 mei 2024
Presentatie 4. Jochen Cremer - TU Delft 28 mei 2024Presentatie 4. Jochen Cremer - TU Delft 28 mei 2024
Presentatie 4. Jochen Cremer - TU Delft 28 mei 2024
 
Tom tresser burning issue.pptx My Burning issue
Tom tresser burning issue.pptx My Burning issueTom tresser burning issue.pptx My Burning issue
Tom tresser burning issue.pptx My Burning issue
 
Announcement of 18th IEEE International Conference on Software Testing, Verif...
Announcement of 18th IEEE International Conference on Software Testing, Verif...Announcement of 18th IEEE International Conference on Software Testing, Verif...
Announcement of 18th IEEE International Conference on Software Testing, Verif...
 
XP 2024 presentation: A New Look to Leadership
XP 2024 presentation: A New Look to LeadershipXP 2024 presentation: A New Look to Leadership
XP 2024 presentation: A New Look to Leadership
 
Gregory Harris' Civics Presentation.pptx
Gregory Harris' Civics Presentation.pptxGregory Harris' Civics Presentation.pptx
Gregory Harris' Civics Presentation.pptx
 
Supercharge your AI - SSP Industry Breakout Session 2024-v2_1.pdf
Supercharge your AI - SSP Industry Breakout Session 2024-v2_1.pdfSupercharge your AI - SSP Industry Breakout Session 2024-v2_1.pdf
Supercharge your AI - SSP Industry Breakout Session 2024-v2_1.pdf
 
María Carolina Martínez - eCommerce Day Colombia 2024
María Carolina Martínez - eCommerce Day Colombia 2024María Carolina Martínez - eCommerce Day Colombia 2024
María Carolina Martínez - eCommerce Day Colombia 2024
 
Mastering the Concepts Tested in the Databricks Certified Data Engineer Assoc...
Mastering the Concepts Tested in the Databricks Certified Data Engineer Assoc...Mastering the Concepts Tested in the Databricks Certified Data Engineer Assoc...
Mastering the Concepts Tested in the Databricks Certified Data Engineer Assoc...
 
Media as a Mind Controlling Strategy In Old and Modern Era
Media as a Mind Controlling Strategy In Old and Modern EraMedia as a Mind Controlling Strategy In Old and Modern Era
Media as a Mind Controlling Strategy In Old and Modern Era
 
Suzanne Lagerweij - Influence Without Power - Why Empathy is Your Best Friend...
Suzanne Lagerweij - Influence Without Power - Why Empathy is Your Best Friend...Suzanne Lagerweij - Influence Without Power - Why Empathy is Your Best Friend...
Suzanne Lagerweij - Influence Without Power - Why Empathy is Your Best Friend...
 
Doctoral Symposium at the 17th IEEE International Conference on Software Test...
Doctoral Symposium at the 17th IEEE International Conference on Software Test...Doctoral Symposium at the 17th IEEE International Conference on Software Test...
Doctoral Symposium at the 17th IEEE International Conference on Software Test...
 
Competition and Regulation in Professions and Occupations – OECD – June 2024 ...
Competition and Regulation in Professions and Occupations – OECD – June 2024 ...Competition and Regulation in Professions and Occupations – OECD – June 2024 ...
Competition and Regulation in Professions and Occupations – OECD – June 2024 ...
 
Competition and Regulation in Professions and Occupations – ROBSON – June 202...
Competition and Regulation in Professions and Occupations – ROBSON – June 202...Competition and Regulation in Professions and Occupations – ROBSON – June 202...
Competition and Regulation in Professions and Occupations – ROBSON – June 202...
 
Collapsing Narratives: Exploring Non-Linearity • a micro report by Rosie Wells
Collapsing Narratives: Exploring Non-Linearity • a micro report by Rosie WellsCollapsing Narratives: Exploring Non-Linearity • a micro report by Rosie Wells
Collapsing Narratives: Exploring Non-Linearity • a micro report by Rosie Wells
 
Gregory Harris - Cycle 2 - Civics Presentation
Gregory Harris - Cycle 2 - Civics PresentationGregory Harris - Cycle 2 - Civics Presentation
Gregory Harris - Cycle 2 - Civics Presentation
 
Bonzo subscription_hjjjjjjjj5hhhhhhh_2024.pdf
Bonzo subscription_hjjjjjjjj5hhhhhhh_2024.pdfBonzo subscription_hjjjjjjjj5hhhhhhh_2024.pdf
Bonzo subscription_hjjjjjjjj5hhhhhhh_2024.pdf
 
2024-05-30_meetup_devops_aix-marseille.pdf
2024-05-30_meetup_devops_aix-marseille.pdf2024-05-30_meetup_devops_aix-marseille.pdf
2024-05-30_meetup_devops_aix-marseille.pdf
 
International Workshop on Artificial Intelligence in Software Testing
International Workshop on Artificial Intelligence in Software TestingInternational Workshop on Artificial Intelligence in Software Testing
International Workshop on Artificial Intelligence in Software Testing
 
ASONAM2023_presection_slide_track-recommendation.pdf
ASONAM2023_presection_slide_track-recommendation.pdfASONAM2023_presection_slide_track-recommendation.pdf
ASONAM2023_presection_slide_track-recommendation.pdf
 

Deep reinforcement learning from scratch

  • 1. Deep Reinforcement Learning from scratch Jie-Han Chen NetDB, National Cheng Kung University 1
  • 2. The content and images in this slides were borrowed from: 1. Rich Sutton’s textbook 2. David Silver’s Reinforcement Learning class in UCL 3. Sergey Levine’s Deep Reinforcement Learning class in UCB 4. Deep Reinforcement Learning and Control in CMU (CMU 10703) 2 Disclaimer
  • 3. Outline 3 1. Introduction RL and MDP 2. Q-Learning 3. Deep Q Network 4. Discussion
  • 4. Introduction - RL 4Figure from Sutton & Barto, RL textbook
  • 5. Reinforcement Learning V.S Supervised Learning Supervised Learning: Input data is independent. Current output will not affect next input data. 5
  • 6. Reinforcement Learning V.S Supervised Learning Reinforcement Learning: The agent’s action affect the data it will receive in the future. (from CMU 10703) 6Figure from Wikipedia, made by waldoalvarez
  • 7. When do we use Reinforcement Learning? 7
  • 8. If the problem can be modeled as MDP, we can try RL to solve it! 8
  • 9. Type of RL task 1. Episodic Task: the task will terminate after number of steps. eg: Game, Chess 2. Continuous Task: the task never terminate. 9
  • 10. Markov Decision Process Defined by: 1. S: set of states 2. A: set of actions 3. R: reward model R(s)/ R(s, a)/ R(s, a, s’) 4. P: dynamics of model and its transition probability 5. : The discounted factor 10
  • 11. Define agent-environment boundary Before defining the set of state, we should define the boundary between agent and environment. According to Richard Sutton’s textbook: 1. “The agent-environment boundary represents the limit of the agent’s absolute control, not of its knowledge.” 2. “The general rule we follow is that anything cannot be changed arbitrarily by the agent is considered to be outside of it and thus part of its environment.” 11
  • 12. Define agent-environment boundary Before defining the set of state, we should define the boundary between agent and environment. According to Richard Sutton’s textbook: 1. “The agent-environment boundary represents the limit of the agent’s absolute control, not of its knowledge.” 2. “The general rule we follow is that anything cannot be changed arbitrarily by the agent is considered to be outside of it and thus part of its environment.” 12
  • 13. Markov Property ● A state is Markov if and only if ● A state should summarize past sensation so as to retain all “essential” information. ● We should be able to throw away the history once state is known. 13from CMU 10703
  • 14. Define State (Observation) 14 Atari 2600: Space Invaders Go
  • 15. Define Action 1. Discrete Action Space 2. Continuous Action Space 15Atari 2600: Breakout Robotic Arm
  • 16. Markov Decision Process Defined by: 1. S: set of states ✓ 2. A: set of actions ✓ 3. R: reward model R(s)/ R(s, a)/ R(s, a, s’) 4. P: dynamics model and its transition probability 5. : The discounted factor 16
  • 17. Define Rewards Rewards specify WHAT the agent needs to achieve, NOT HOW to achieve it. 17 S: start state G: Goal
  • 18. Define Rewards Rewards specify WHAT the agent needs to achieve, NOT HOW to achieve it. 18
  • 19. Markov Decision Process Defined by: 1. S: set of states ✓ 2. A: set of actions ✓ 3. R: reward model R(s)/ R(s, a)/ R(s, a, s’) ✓ 4. P: dynamics of model and its transition probability 5. : The discounted factor 19
  • 20. Markov Decision Process Definition: A policy is a distribution over actions given states, MDP policies depend on the current state (time-independent) 20
  • 21. Markov Decision Process The objective in RL is to maximize long-trem future reward Definition: The return is the total discounted reward from timestep t In episodic tasks, we can consider undiscounted future wards 21
  • 22. Markov Decision Process Definition: The state-value function of an MDP is the expected return starting from state s, and then following policy Definition: The action-value function is the expected return starting from state s, taking action a, and then following policy 22
  • 23. Bellman Expected Equation The state-value function can be decomposed into immediate reward plus discounted value of successor state, The action-value function can similarly be decomposed, 23
  • 24. Optimal Value Functions Definition: The optimal state-value function is the maximum value function over all policies. Definition: The optimal action-value function is the maximum action-value function over all policies. 24
  • 25. We can use backup diagram to explain the relationship between and , and how to update each other. Backup Diagram 25
  • 26. We can use backup diagram to explain the relationship between and , and how to update each other. Backup Diagram 26
  • 29. Optimal Policy if Theorem: For any Markov Decision Process 1. There exists an optimal policy that is better than or equal to all other policies, 2. All optimal policies achieve the optimal value function, 3. All optimal policies achieve the optimal action-value function, 29
  • 30. How to get Optimal Policies? An optimal policy can be found by maximizing over There is always a deterministic optimal policy for and MDP If we know , we immediately have the optimal policy. 30
  • 32. Solving Markov Decision Process ● Find the optimal policy ● Prediction: for a given policy, estimate value functions of state and state-action pairs. ● Control: estimate the value function of state and state-action pairs for the optimal policy. 32
  • 33. Solving the Bellman Optimality Equation Equation requires the following: 1. accurate knowledge of environments dynamics 2. we have enough space and time to do the computation 3. the Markov Property 33
  • 34. Markov Decision Process Defined by: 1. S: set of states ✓ 2. A: set of actions ✓ 3. R: reward model R(s)/ R(s, a)/ R(s, a, s’) ✓ 4. P: dynamics of model and its transition probability 5. : The discounted factor 34
  • 35. Outline 35 1. Introduction RL and MDP 2. Q-Learning 3. Deep Q Network 4. Discussion
  • 36. The category of RL 36 Value-based: select action according to value function, SGD on Bellman Error. Policy-based: using SGD directly on discounted expected return with policy Model-based: Learning the mode from interact with environment or simulate trajectory to estimate environment model. eg: Dyna, MCTS
  • 37. The category of RL Model-based method: Learn the model of the MDP (transition probability and rewards) and try to solve MDP concurrently. Model-free method: Learn how to act without explicitly learning the transition probability 37
  • 38. Model-free Reinforcement Learning Using sample backup, sample transition experience: 38
  • 39. Q-Learning Proposed by Watkins, 1989 ● A model-free algorithm ● Tabular method: using large table to save each action-value pair Q(s, a) ● Learn from one step experience: ● Off-policy ● Online learning Update Q table: 39
  • 40. Q-Learning Learning by sample: Update Q: bootstrapping: using the estimate of the return as the target to update old value function. 40 Target an estimate of the return step size
  • 42. Q-Learning v.s. Optimal MDP Q-Learning: Optimal MDP: 42
  • 43. Off-policy Off-policy: If the agent learn the policy from the experience, which was generated by other policy (not current policy), we call this algorithm is off-policy. Why Q-Learning is off-policy? ● given experience: ● update Q: 43
  • 44. On-policy The agent can only learn the policy from the experience, which was generated by current policy. If the experience is not generated by current policy, the learning process won’t converge. 44
  • 45. But there is still a problem If we use optimal policy at all times, most of Q table won’t be updated, and we will found the policy NOT OPTIMAL. 45
  • 46. Exploration v.s Exploitation Exploration: gather more information Exploitation: make the best decision given current information Q-Learing use strategy: ● With probability , select ● With probability , select a random action. 46
  • 48. Q-Learning Algorithm Tabular method needs tremendous memory to store action-value pair, when facing large/high dimensional state space it suffers from the curse of dimensionality. Only can be used in discrete action task. Because of it select the optimal action by 48
  • 49. We need Function Approximator! 49
  • 50. Function Approximator There are many kinds of function approximator: ● Linear combination of features ● Neural networks ● Decision Tree ● Nearest neighbour ● Fourier/wavlet bases ● ... 50
  • 52. Deep Q Network 1. Proposed by V Mnih, K Kavukcuoglu, David Silver et al., DeepMind [1][2] 2. Using neural network as non-linear function approximator 3. DQN = Q-Learning + Deep Network 4. Testbed: 49 Atari Game 52 [1]V Mnih et al., Playing Atari with Deep Reinforcement Learning [2]V Mnih et al., Human-level control through deep reinforcement learning (2015 Nature)
  • 53. Deep Q Network - Define MDP Is it an episodic task or continuous task? Is the action space discrete or continuous? How to define state? Is it Markov? How to define rewards? 53
  • 54. Deep Q Network - Define MDP 1. The game is episodic task a. if there are multiple lives each game, they define terminal state when losing a life. 2. The action space is discrete 3. They using multi-frame as state, 4-frame here. Because of the object motion cannot be detected by only 1 frame. 1-frame state is not Markov. 4. Clip the rewards between [-1, 1] a. limit the scale of error derivatives b. make it easier to use the same learning rate across multiple games 54
  • 55. Deep Q Network - State in details 1. The origin screen size is 210x160x3 (RGB) 2. They transformed the origin screen into Grayscale (210x160x1) 3. Resize the screen size to 84x84 to train faster 4. Stack the nearest 4 screen frame together as its state 55
  • 56. Deep Q Network - Architecture 56 DQN !
  • 57. Deep Q Network - Architecture (2013) 1. 2 Convolutional neural network a. 16 filters, 8x8 each with 4 stride b. 32 filters, 4x4 each with 2 stride 2. 2 Fully Connected network a. flatten to 256 neurons b. 256 to # of actions (output layer) 57 3. Without: a. pooling b. batch normalization c. dropout
  • 58. Deep Q Network - Architecture (2015) 1. 3 Convolutional neural network a. 32 filters, 8x8 each with 4 stride b. 64 filters, 4x4 each with 2 stride c. 64 filters, 3x3 each with 1 stride 2. 2 Fully Connected network a. flatten to 512 neurons b. 512 to # of actions (output layer) 58 3. Again without: a. pooling b. batch normalization c. dropout
  • 59. Deep Q Network - preliminary summary Currently, we have: 1. Markov Decision Process 2. Non-linear function approximator to estimate we can apply to random control. But, we want our agent performs better and better. 59
  • 60. Deep Q Network - Algorithm In previous slides, we define optimal action-value function in MDP. which was: we can iteratively update action-value by: when , which means it converge. 60
  • 61. Deep Q Network - Algorithm However, because we estimate the action-value by non-linear function approximator, we cannot directly update the action-value by the formula (in right hand side). It just works in linear function approximator. 61
  • 62. Deep Q Network - Algorithm The good news: in neural network, we can use Stochastic Gradient Descent (SGD) to approach Q* (a estimate, not equal) In supervised learning, we often model this problem as an regression problem. eg: 62 is weights of neural network in iteration i target
  • 63. Deep Q Network - Algorithm recap: the concept of neural network in supervised learning, the target is fixed! The fixed target doesn’t need to gradient. How to fix it? 63
  • 64. Deep Q Network - Algorithm Using seperated network to fix SGD: ● evaluation network: to estimate current action-value ● target network: as an fixed target. We initialize target network using the same weights as evaluation network The gradient of Loss function: 64
  • 65. Deep Q Network - Algorithm Update neural weights 65
  • 66. Deep Q Network - Algorithm We use online-learning in DQN, just like Q-learning: step1: we observe the environement, get observation step2: we take the action according to current observation step3: update the neural weights 66 This is called sampling, sample experience (s, a, r, s’)
  • 67. Wait, there still exists another problem! 67
  • 69. Deep Q Network - Algorithm There still exist another problem -- correlation. They use experience replay to solve it! 69
  • 70. Deep Q Network - Algorithm Experience replay: when the agent iteract with environment with policy it will store transition experience (s, a, r, s’) in replay buffer. When learning with SGD, the agent sample batch-experience from replay buffer, learning batch by batch. 70
  • 71. 71
  • 72. Experiment settings SGD optimizer: RMSProp Learning rate: 2.5e-4 (0.00025) batch size: 32 Loss function: MSE Loss, clip loss within [-1, 1] Decay epsilon (exploration rate) from 1.0 to 0.1 in 1M steps 72
  • 73. Deep Q Network - Result The human performance is the average reward achieved from around 20 episodes of each game lasting a maximum of 5 min each, following around 2 h of practice playing each game. 73 You can see the figure at p.3: https://storage.googleapis.com/deepmind-media/dqn/DQNNat urePaper.pdf
  • 74. Experiments on Space Invaders (Atari2600) 74
  • 75. Space Invaders 1. We have 3 lives (episodic task) 2. We also have 3 Shields 3. Need to beat all Invaders 4. The bullets blink with some frequency 75
  • 77. DQN: Huber Loss + Adam Learning rate: 2.5e-4 77
  • 78. DQN: MSE clamp loss + RMSProp Learning rate: 2.5e-4 Total steps: 1e+7 78
  • 79. DQN: MSE clamp loss + RMSProp 79
  • 80. DQN: Paper settings ● MSE Loss, clamp loss within [-1, 1] ● using RMSProp as optimizer, LR=2.5e-4 80 750 !!
  • 81. DQN: Huber Loss (without clamp loss) + RMSProp 81
  • 82. DQN: MSE clamp loss + Adam 82
  • 83. DQN: Huber Loss + Adam 83
  • 84. The content not covered in this slides The Proof of convergence of linear function approximator & non-linear function approximator, but you can find it in Rich Sutton’s text book in Ch9 - Ch11. 84