1
Reinforcement Learning
By Usman Qayyum
13, Nov, 2018
Machine Learning Expert ?
2
Supervised Learning suffers from
underline human-bias present in the data
Machine Learning
• Supervised Learning
Example Class
• Reinforcement Learning
Situation Reward Situation Reward
…
• Un-Supervised Learning
Example
Classification
Regression
Clustering
Auto-Encoder
Qlearning, DQN
Policy Gradient
Actor-Critic
3
Human Learning (Trail & Error)
● Achieves Goal Fail to achieve Goal
Baby starts walking and successfully reaches the couch
4
Reinforcement Learning
● Trial & error learning
● Learning from interaction
● Learning what to do—how to map
situations to actions—so as to maximize a
numerical reward signal
5
How to Formulate RL Problem
Environment—Physical world in which the agent
operates
State—Current situation of the agent
Action— Agent interaction with environment
through actions
Reward—Feedback from the environment
Policy—Method to map agent’s state to actions
Value—Future reward that an agent would receive
by taking an action in a particular state
6
RL Applications (Games/Networking)
Objective Complete the game with the highest score
State Raw pixel inputs of the game state
Action Game controls e.g. Left, Right, Up, Down
Reward Score increase/decrease at each time step
Objective Win the game!
State Position of all pieces
Action Where to put the next piece down
Reward 1 if win at the end of the game, 0 otherwise
Objective Intelligent Channel Selection
State Occupation on each channel in current time slot
Action Set the channel to be used for the next time slot
Reward +1 in case of no collision with interferer
otherwise -17
Markov Decision Process 
8
Markov Decision Process
9
• MDP is used to describe an environment for reinforcement learning
• Almost all RL problems can be formalized as MDPs
Markov property states that, “ The future is independent of the past given the present.”
P[St+1 | St ] = P[ St+1 | S1, ….. , St ]
Markov Chain Transition matrix
Markov reward
Model / Model-Free Learning
10
Environment (Taxi Game)
11
Representations
WALL --> (Can't pass through, will remain in the same position
Yellow --> Taxi Current Location
Blue --> Pick up Location
Purple --> Drop-off Location
Green --> Taxi turn green once passenger board
Q Learning …
● Q-Table is just a fancy name for a simple lookup table where we calculate
the maximum expected future rewards for action at each state.
But the questions are:
How do we calculate the values of the Q-table?
Are the values available or predefined?12
States = 500
Actions
0: move south
1: move north
2: move east
3: move west
4: pickup passenger
5: dropoff passenger
Reward:
+20: successfully pick up a passenger and
drop them off at desired location
-1: for each step
-10: every time you incorrectly pick up or
drop off a passenger
Q Learning …
Step1: When the episode initially starts, every Q-value is 0.
13
Q Learning …
Step 2&3: choose and perform an action
In the beginning, the agent will explore the environment and randomly choose actions.
As the agent explores the environment, the agent starts to exploit the environment.
14
Q Learning …
Step 4 & 5: Measure reward and Update Q Table
The Q-function uses the Bellman equation and takes two inputs: state (s) and action (a).
Learning Rate Discount Factor (Future reward)
15
Q-Learning to DQN
16
Google Deep-mind (Deep Q-Network)
17 “Human-level control through deep reinforcement learning”, Nature, 2015
Gym
A library that can simulate large numbers of reinforcement learning environments, including Atari games
18
• Lack of standardization of environments used in publications
• The need for better benchmarks.
Example: Taxi Game Problem (OpenAI Gym)
19
Example-1
20
Example-2
21
Example-2 …
22
23
Deep Q-Network
Human-level control through deep reinforcement learning – Nature Vol 518, Feb 26, 2015
By Usman Qayyum
15, Nov, 2018
24
Model-Free RL (Recap)
● Policy-based RL
○ Search directly for the optimal policy ∏*
○ This is the policy achieving maximum future reward
● Value-based RL
○ Estimate the optimal value function Q*(s,a)
○ This is the maximum value achievable under any
policy
25
Q-Learning to DQN (Value based RL )
26
Q-table is like a “cheat-sheet” to help us to find the maximum expected
future reward of an action, given a current state.
• Good strategy — however, this is not scalable.
Playing Atari with Deep RL (Nature, 2015)
● Played seven Atari 2600 games
● Beat previous ML approaches on six
● Beat human expert on three
● Aim to create a single neural network
agent that is able to successfully learn
to play as many of the games as
possible.
● Learns strictly from experience - no pre-
training.
● Inputs: game screen + score.
● No game-specific tuning.
27
What’s Next
28
Atari
● Rules of the game unknown
● Learn directly from interactive
game play
● Pick Action on joystick, see pixels
and score
29
Preprocessing & Temporal limitation
30
Convolution Layer/Fully Connected
31
• Frames are processed by three convolution layers.
• These layers allow you to exploit spatial relationships in images.
• But also, because frames are stacked together, you can exploit
some spatial properties across those frames.
Experience Replay
32
Experience replay will help us to handle two things:
Avoid forgetting previous experiences: the variability of the weights, because
there is high correlation between actions and states.
Solution: create a “replay buffer.” This stores experience tuples while interacting
with the environment, and then we sample a small batch of tuple to feed our neural
network.
Reduce correlations between experiences: we know that every action affects the next state. This
outputs a sequence of experience tuples which can be highly correlated
Solution: By sampling from the replay buffer at random, we can break this correlation. This prevents
action values from oscillating or diverging catastrophically.
Clipping Rewards
33
Each game has different score scales. For example, in Pong, players
can get 1 point when wining the play. Otherwise, players get -1 point.
However, in SpaceInvaders, players get 10~30 points when defeating
invaders. This difference would make training unstable.
Thus Clipping Rewards technique clips scores, which all positive
rewards are set +1 and all negative rewards are set -1.
DQN Algorithm
34
Performance
35
Recent Graph from Google Deepmind, 2018
(current trend in RL Gaming)
Naïve DQN vs Replay-buffer-based DQN
STRENGTHS AND WEAKNESSES
● Good at
‣ Quick-moving, complex, short-horizon games ‣ Semi-independent trails
within the game
‣ Negative feedback on failure
● Bad at
‣ long-horizon games that don’t converge ‣ Any “walking around” game
‣ Montezuma’s revenge
Worldly knowledge helps humans play these games relatively easily.
36
Example Code
● DQN with Atari Game
○ Colab jupyter notebooks
37
Reference
● Rich Sutton, Reinforcement Learning: an introduction, 2017
● Deep Reinforcement Learning, An overview, 2017 https://arxiv.org/pdf/1701.07274.pdf
● UCL course Reinforcement Learning:
http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html
● CS231, Reinfrocement Learning, Lecture 14, 2017
http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture14.pdf
● Thomas Simonini, Medium Post “An introduction to Reinforcement Learning”
https://medium.freecodecamp.org/an-introduction-to-reinforcement-learning-
4339519de419
● Arthur Juliani, Medium Post “Simple Reinforcement Learning in Tensorflow”,
https://medium.com/@awjuliani/super-simple-reinforcement-learning-tutorial-part-1-
fd544fab149
38

Deep Reinforcement Learning

  • 1.
  • 2.
    Machine Learning Expert? 2 Supervised Learning suffers from underline human-bias present in the data
  • 3.
    Machine Learning • SupervisedLearning Example Class • Reinforcement Learning Situation Reward Situation Reward … • Un-Supervised Learning Example Classification Regression Clustering Auto-Encoder Qlearning, DQN Policy Gradient Actor-Critic 3
  • 4.
    Human Learning (Trail& Error) ● Achieves Goal Fail to achieve Goal Baby starts walking and successfully reaches the couch 4
  • 5.
    Reinforcement Learning ● Trial& error learning ● Learning from interaction ● Learning what to do—how to map situations to actions—so as to maximize a numerical reward signal 5
  • 6.
    How to FormulateRL Problem Environment—Physical world in which the agent operates State—Current situation of the agent Action— Agent interaction with environment through actions Reward—Feedback from the environment Policy—Method to map agent’s state to actions Value—Future reward that an agent would receive by taking an action in a particular state 6
  • 7.
    RL Applications (Games/Networking) ObjectiveComplete the game with the highest score State Raw pixel inputs of the game state Action Game controls e.g. Left, Right, Up, Down Reward Score increase/decrease at each time step Objective Win the game! State Position of all pieces Action Where to put the next piece down Reward 1 if win at the end of the game, 0 otherwise Objective Intelligent Channel Selection State Occupation on each channel in current time slot Action Set the channel to be used for the next time slot Reward +1 in case of no collision with interferer otherwise -17
  • 8.
  • 9.
    Markov Decision Process 9 •MDP is used to describe an environment for reinforcement learning • Almost all RL problems can be formalized as MDPs Markov property states that, “ The future is independent of the past given the present.” P[St+1 | St ] = P[ St+1 | S1, ….. , St ] Markov Chain Transition matrix Markov reward
  • 10.
    Model / Model-FreeLearning 10
  • 11.
    Environment (Taxi Game) 11 Representations WALL--> (Can't pass through, will remain in the same position Yellow --> Taxi Current Location Blue --> Pick up Location Purple --> Drop-off Location Green --> Taxi turn green once passenger board
  • 12.
    Q Learning … ●Q-Table is just a fancy name for a simple lookup table where we calculate the maximum expected future rewards for action at each state. But the questions are: How do we calculate the values of the Q-table? Are the values available or predefined?12 States = 500 Actions 0: move south 1: move north 2: move east 3: move west 4: pickup passenger 5: dropoff passenger Reward: +20: successfully pick up a passenger and drop them off at desired location -1: for each step -10: every time you incorrectly pick up or drop off a passenger
  • 13.
    Q Learning … Step1:When the episode initially starts, every Q-value is 0. 13
  • 14.
    Q Learning … Step2&3: choose and perform an action In the beginning, the agent will explore the environment and randomly choose actions. As the agent explores the environment, the agent starts to exploit the environment. 14
  • 15.
    Q Learning … Step4 & 5: Measure reward and Update Q Table The Q-function uses the Bellman equation and takes two inputs: state (s) and action (a). Learning Rate Discount Factor (Future reward) 15
  • 16.
  • 17.
    Google Deep-mind (DeepQ-Network) 17 “Human-level control through deep reinforcement learning”, Nature, 2015
  • 18.
    Gym A library thatcan simulate large numbers of reinforcement learning environments, including Atari games 18 • Lack of standardization of environments used in publications • The need for better benchmarks.
  • 19.
    Example: Taxi GameProblem (OpenAI Gym) 19
  • 20.
  • 21.
  • 22.
  • 23.
    23 Deep Q-Network Human-level controlthrough deep reinforcement learning – Nature Vol 518, Feb 26, 2015 By Usman Qayyum 15, Nov, 2018
  • 24.
  • 25.
    Model-Free RL (Recap) ●Policy-based RL ○ Search directly for the optimal policy ∏* ○ This is the policy achieving maximum future reward ● Value-based RL ○ Estimate the optimal value function Q*(s,a) ○ This is the maximum value achievable under any policy 25
  • 26.
    Q-Learning to DQN(Value based RL ) 26 Q-table is like a “cheat-sheet” to help us to find the maximum expected future reward of an action, given a current state. • Good strategy — however, this is not scalable.
  • 27.
    Playing Atari withDeep RL (Nature, 2015) ● Played seven Atari 2600 games ● Beat previous ML approaches on six ● Beat human expert on three ● Aim to create a single neural network agent that is able to successfully learn to play as many of the games as possible. ● Learns strictly from experience - no pre- training. ● Inputs: game screen + score. ● No game-specific tuning. 27
  • 28.
  • 29.
    Atari ● Rules ofthe game unknown ● Learn directly from interactive game play ● Pick Action on joystick, see pixels and score 29
  • 30.
  • 31.
    Convolution Layer/Fully Connected 31 •Frames are processed by three convolution layers. • These layers allow you to exploit spatial relationships in images. • But also, because frames are stacked together, you can exploit some spatial properties across those frames.
  • 32.
    Experience Replay 32 Experience replaywill help us to handle two things: Avoid forgetting previous experiences: the variability of the weights, because there is high correlation between actions and states. Solution: create a “replay buffer.” This stores experience tuples while interacting with the environment, and then we sample a small batch of tuple to feed our neural network. Reduce correlations between experiences: we know that every action affects the next state. This outputs a sequence of experience tuples which can be highly correlated Solution: By sampling from the replay buffer at random, we can break this correlation. This prevents action values from oscillating or diverging catastrophically.
  • 33.
    Clipping Rewards 33 Each gamehas different score scales. For example, in Pong, players can get 1 point when wining the play. Otherwise, players get -1 point. However, in SpaceInvaders, players get 10~30 points when defeating invaders. This difference would make training unstable. Thus Clipping Rewards technique clips scores, which all positive rewards are set +1 and all negative rewards are set -1.
  • 34.
  • 35.
    Performance 35 Recent Graph fromGoogle Deepmind, 2018 (current trend in RL Gaming) Naïve DQN vs Replay-buffer-based DQN
  • 36.
    STRENGTHS AND WEAKNESSES ●Good at ‣ Quick-moving, complex, short-horizon games ‣ Semi-independent trails within the game ‣ Negative feedback on failure ● Bad at ‣ long-horizon games that don’t converge ‣ Any “walking around” game ‣ Montezuma’s revenge Worldly knowledge helps humans play these games relatively easily. 36
  • 37.
    Example Code ● DQNwith Atari Game ○ Colab jupyter notebooks 37
  • 38.
    Reference ● Rich Sutton,Reinforcement Learning: an introduction, 2017 ● Deep Reinforcement Learning, An overview, 2017 https://arxiv.org/pdf/1701.07274.pdf ● UCL course Reinforcement Learning: http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html ● CS231, Reinfrocement Learning, Lecture 14, 2017 http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture14.pdf ● Thomas Simonini, Medium Post “An introduction to Reinforcement Learning” https://medium.freecodecamp.org/an-introduction-to-reinforcement-learning- 4339519de419 ● Arthur Juliani, Medium Post “Simple Reinforcement Learning in Tensorflow”, https://medium.com/@awjuliani/super-simple-reinforcement-learning-tutorial-part-1- fd544fab149 38