Deep Reinforcement Learning

1
Reinforcement Learning
By Usman Qayyum
13, Nov, 2018

Machine Learning Expert ?
2
Supervised Learning suffers from
underline human-bias present in the data

Machine Learning
• Supervised Learning
Example Class
• Reinforcement Learning
Situation Reward Situation Reward
…
• Un-Supervised Learning
Example
Classification
Regression
Clustering
Auto-Encoder
Qlearning, DQN
Policy Gradient
Actor-Critic
3

Human Learning (Trail & Error)
● Achieves Goal Fail to achieve Goal
Baby starts walking and successfully reaches the couch
4

Reinforcement Learning
● Trial & error learning
● Learning from interaction
● Learning what to do—how to map
situations to actions—so as to maximize a
numerical reward signal
5

How to Formulate RL Problem
Environment—Physical world in which the agent
operates
State—Current situation of the agent
Action— Agent interaction with environment
through actions
Reward—Feedback from the environment
Policy—Method to map agent’s state to actions
Value—Future reward that an agent would receive
by taking an action in a particular state
6

RL Applications (Games/Networking)
Objective Complete the game with the highest score
State Raw pixel inputs of the game state
Action Game controls e.g. Left, Right, Up, Down
Reward Score increase/decrease at each time step
Objective Win the game!
State Position of all pieces
Action Where to put the next piece down
Reward 1 if win at the end of the game, 0 otherwise
Objective Intelligent Channel Selection
State Occupation on each channel in current time slot
Action Set the channel to be used for the next time slot
Reward +1 in case of no collision with interferer
otherwise -17

Markov Decision Process
9
• MDP is used to describe an environment for reinforcement learning
• Almost all RL problems can be formalized as MDPs
Markov property states that, “ The future is independent of the past given the present.”
P[St+1 | St ] = P[ St+1 | S1, ….. , St ]
Markov Chain Transition matrix
Markov reward

Model / Model-Free Learning
10

Environment (Taxi Game)
11
Representations
WALL --> (Can't pass through, will remain in the same position
Yellow --> Taxi Current Location
Blue --> Pick up Location
Purple --> Drop-off Location
Green --> Taxi turn green once passenger board

Q Learning …
● Q-Table is just a fancy name for a simple lookup table where we calculate
the maximum expected future rewards for action at each state.
But the questions are:
How do we calculate the values of the Q-table?
Are the values available or predefined?12
States = 500
Actions
0: move south
1: move north
2: move east
3: move west
4: pickup passenger
5: dropoff passenger
Reward:
+20: successfully pick up a passenger and
drop them off at desired location
-1: for each step
-10: every time you incorrectly pick up or
drop off a passenger

Q Learning …
Step1: When the episode initially starts, every Q-value is 0.
13

Q Learning …
Step 2&3: choose and perform an action
In the beginning, the agent will explore the environment and randomly choose actions.
As the agent explores the environment, the agent starts to exploit the environment.
14

Q Learning …
Step 4 & 5: Measure reward and Update Q Table
The Q-function uses the Bellman equation and takes two inputs: state (s) and action (a).
Learning Rate Discount Factor (Future reward)
15

Google Deep-mind (Deep Q-Network)
17 “Human-level control through deep reinforcement learning”, Nature, 2015

Gym
A library that can simulate large numbers of reinforcement learning environments, including Atari games
18
• Lack of standardization of environments used in publications
• The need for better benchmarks.

Example: Taxi Game Problem (OpenAI Gym)
19

23
Deep Q-Network
Human-level control through deep reinforcement learning – Nature Vol 518, Feb 26, 2015
By Usman Qayyum
15, Nov, 2018

Model-Free RL (Recap)
● Policy-based RL
○ Search directly for the optimal policy ∏*
○ This is the policy achieving maximum future reward
● Value-based RL
○ Estimate the optimal value function Q*(s,a)
○ This is the maximum value achievable under any
policy
25

Q-Learning to DQN (Value based RL )
26
Q-table is like a “cheat-sheet” to help us to find the maximum expected
future reward of an action, given a current state.
• Good strategy — however, this is not scalable.

Playing Atari with Deep RL (Nature, 2015)
● Played seven Atari 2600 games
● Beat previous ML approaches on six
● Beat human expert on three
● Aim to create a single neural network
agent that is able to successfully learn
to play as many of the games as
possible.
● Learns strictly from experience - no pre-
training.
● Inputs: game screen + score.
● No game-specific tuning.
27

Atari
● Rules of the game unknown
● Learn directly from interactive
game play
● Pick Action on joystick, see pixels
and score
29

Preprocessing & Temporal limitation
30

Convolution Layer/Fully Connected
31
• Frames are processed by three convolution layers.
• These layers allow you to exploit spatial relationships in images.
• But also, because frames are stacked together, you can exploit
some spatial properties across those frames.

Experience Replay
32
Experience replay will help us to handle two things:
Avoid forgetting previous experiences: the variability of the weights, because
there is high correlation between actions and states.
Solution: create a “replay buffer.” This stores experience tuples while interacting
with the environment, and then we sample a small batch of tuple to feed our neural
network.
Reduce correlations between experiences: we know that every action affects the next state. This
outputs a sequence of experience tuples which can be highly correlated
Solution: By sampling from the replay buffer at random, we can break this correlation. This prevents
action values from oscillating or diverging catastrophically.

Clipping Rewards
33
Each game has different score scales. For example, in Pong, players
can get 1 point when wining the play. Otherwise, players get -1 point.
However, in SpaceInvaders, players get 10~30 points when defeating
invaders. This difference would make training unstable.
Thus Clipping Rewards technique clips scores, which all positive
rewards are set +1 and all negative rewards are set -1.

Performance
35
Recent Graph from Google Deepmind, 2018
(current trend in RL Gaming)
Naïve DQN vs Replay-buffer-based DQN

STRENGTHS AND WEAKNESSES
● Good at
‣ Quick-moving, complex, short-horizon games ‣ Semi-independent trails
within the game
‣ Negative feedback on failure
● Bad at
‣ long-horizon games that don’t converge ‣ Any “walking around” game
‣ Montezuma’s revenge
Worldly knowledge helps humans play these games relatively easily.
36

Example Code
● DQN with Atari Game
○ Colab jupyter notebooks
37

Reference
● Rich Sutton, Reinforcement Learning: an introduction, 2017
● Deep Reinforcement Learning, An overview, 2017 https://arxiv.org/pdf/1701.07274.pdf
● UCL course Reinforcement Learning:
http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html
● CS231, Reinfrocement Learning, Lecture 14, 2017
http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture14.pdf
● Thomas Simonini, Medium Post “An introduction to Reinforcement Learning”
https://medium.freecodecamp.org/an-introduction-to-reinforcement-learning-
4339519de419
● Arthur Juliani, Medium Post “Simple Reinforcement Learning in Tensorflow”,
https://medium.com/@awjuliani/super-simple-reinforcement-learning-tutorial-part-1-
fd544fab149
38

Deep Reinforcement Learning

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Deep Reinforcement Learning

Similar to Deep Reinforcement Learning (20)

More from Usman Qayyum

More from Usman Qayyum (6)

Recently uploaded

Recently uploaded (20)

Deep Reinforcement Learning