Introduction to
Reinforcement Learning
Marsan Ma
2019.10.16
● Controller interactive with Environment
● Evaluate the performance, give a reward to make controller learn better strategy.
Idea
Reward
Frameworks
● Value based
○ Q-Learning
○ Deep Q-Learning family
○ Sarsa (state-action-reward-state-action ...)
● Policy based
○ Policy gradient
● Actor-Critic
○ DDPG (deep deterministic policy gradient)
Reinforcement Learning - I
Q Learning
● Q-Learning
○ The “model” is actually a table of f(State, Action) = Reward
○ Example: a cursor V start from beginning of the line O, trying to move to the end of the line
V _ _ _ _ _ O
○ Target: we want a f(State, Action) = Reward table like this to guide cursor move toward the right:
(row as position, column as action, cell as reward value)
Q-Learning : task
● Q-Learning
○ Initialize a Q-table, which we’re gonna train it.
○ Choosing action greedy with probability EPSILON (exploitation)
Choosing action randomly with probability 1-EPSILON (exploration)
Q-Learning : how Q-table work
● Q-Learning
○ Get interact result (newState, Reward) from environment.
Q-Learning : environment react to action
● Q-Learning
○ Do the training by replay scenario for MAX_EPISODES times.
■ Update Q-table during the training, since that’s your final “model”.
Q-Learning : how model trained
● Q-table updating
○ If next_state is terminal, claim whole reward as q_value
○ If next_state is not terminal, claim reward + decayed estimated future value
○ Update Q-table value with learning_rate=alpha
Q-Learning : how Q-table updated
● Q-Learning
○ DEMO - Google Colab notebook
Q-Learning
Reinforcement Learning - II
Policy Gradient
● Main concept
○ Ignoring the “value of action”, only have actions and it’s probability
(value based like Q-learning, you need huge table to store all actions)
○ No longer need “Epsilon” to do explore/exploit, since it natively choose action by probability.
Policy Gradient
● Controller initialization
○ Very simple network, since there’s no fancy data characteristic in this toy task
Policy Gradient : initialize
● Training Process
○ Learn only after episode ended, data efficiency is low.
Policy Gradient : training
● Training Process
○ Rewards are degraded by time, then scaled.
○ Loss is decided by past policy history and reward, then back propagate to update the network.
○ Loss = how “surprise” you are to the reward = -reward*chosen_policy
Policy Gradient : update network
● Policy Gradient
○ DEMO - Google Colab notebook
Policy Gradient
Reinforcement Learning - III
Deep Q Learning
● Main concept
○ In real world, search space of states / actions might be too large to be stored.
○ Rather than storing all combinations as Q-table, how about summarize them into a model?
=> Neural Network is good at representing arbitrary complex functions.
● Other reasons why DQN powerful
○ Experience replay
■ Learn from past experience as many times as you wish
○ Fixed Q-target
■ The improved version of Q-learning, solving the “chasing tail” problem while doing eval and
target with the same network.
■ Use 2 duplicated network, one for predict next action, the other used for evaluation. To
avoid “chasing own tail” condition and make training stable.
Deep Q-Learning
● Initialize
○ Note that there’s 2 identical network: eval_net, target_net
Deep Q-Learning
● Choose action: still about exploration/exploitation
Deep Q-Learning
● Training process
○ Learn every N step, better than policy network
○ Update state in the end of episode
Deep Q-Learning
● Learning process :
○ eval_net pursuing target_net for a batch replay
○ Then copy eval_net to target_net
Deep Q-Learning
● Deep Q-Learning network
○ DEMO - Google Colab notebook
Deep Q-Learning
Reinforcement Learning - IV
Deep Deterministic
Policy Gradient
● Main concept
○ Actor-Critic
■ Actor as policy gradients, good at choosing from continuous action space.
■ Critic as value based (like Q-learning), update by step, good learning efficiency.
DDPG
● DDPG : Actor & Critic (minimum version)
DDPG
● DDPG trainer : coordinate 2 actors + 2 critics (each have eval/target)
DDPG
● DDPG : choosing actions by explore & exploit
DDPG
● DDPG - training
DDPG
● DDPG - learn
DDPG
● DDPG - soft_update / hard_update
DDPG
● DDPG
○ DEMO - Google Colab notebook
DDPG

Introduction to reinforcement learning

  • 1.
  • 2.
    ● Controller interactivewith Environment ● Evaluate the performance, give a reward to make controller learn better strategy. Idea Reward
  • 3.
    Frameworks ● Value based ○Q-Learning ○ Deep Q-Learning family ○ Sarsa (state-action-reward-state-action ...) ● Policy based ○ Policy gradient ● Actor-Critic ○ DDPG (deep deterministic policy gradient)
  • 4.
  • 5.
    ● Q-Learning ○ The“model” is actually a table of f(State, Action) = Reward ○ Example: a cursor V start from beginning of the line O, trying to move to the end of the line V _ _ _ _ _ O ○ Target: we want a f(State, Action) = Reward table like this to guide cursor move toward the right: (row as position, column as action, cell as reward value) Q-Learning : task
  • 6.
    ● Q-Learning ○ Initializea Q-table, which we’re gonna train it. ○ Choosing action greedy with probability EPSILON (exploitation) Choosing action randomly with probability 1-EPSILON (exploration) Q-Learning : how Q-table work
  • 7.
    ● Q-Learning ○ Getinteract result (newState, Reward) from environment. Q-Learning : environment react to action
  • 8.
    ● Q-Learning ○ Dothe training by replay scenario for MAX_EPISODES times. ■ Update Q-table during the training, since that’s your final “model”. Q-Learning : how model trained
  • 9.
    ● Q-table updating ○If next_state is terminal, claim whole reward as q_value ○ If next_state is not terminal, claim reward + decayed estimated future value ○ Update Q-table value with learning_rate=alpha Q-Learning : how Q-table updated
  • 10.
    ● Q-Learning ○ DEMO- Google Colab notebook Q-Learning
  • 11.
    Reinforcement Learning -II Policy Gradient
  • 12.
    ● Main concept ○Ignoring the “value of action”, only have actions and it’s probability (value based like Q-learning, you need huge table to store all actions) ○ No longer need “Epsilon” to do explore/exploit, since it natively choose action by probability. Policy Gradient
  • 13.
    ● Controller initialization ○Very simple network, since there’s no fancy data characteristic in this toy task Policy Gradient : initialize
  • 14.
    ● Training Process ○Learn only after episode ended, data efficiency is low. Policy Gradient : training
  • 15.
    ● Training Process ○Rewards are degraded by time, then scaled. ○ Loss is decided by past policy history and reward, then back propagate to update the network. ○ Loss = how “surprise” you are to the reward = -reward*chosen_policy Policy Gradient : update network
  • 16.
    ● Policy Gradient ○DEMO - Google Colab notebook Policy Gradient
  • 17.
    Reinforcement Learning -III Deep Q Learning
  • 18.
    ● Main concept ○In real world, search space of states / actions might be too large to be stored. ○ Rather than storing all combinations as Q-table, how about summarize them into a model? => Neural Network is good at representing arbitrary complex functions. ● Other reasons why DQN powerful ○ Experience replay ■ Learn from past experience as many times as you wish ○ Fixed Q-target ■ The improved version of Q-learning, solving the “chasing tail” problem while doing eval and target with the same network. ■ Use 2 duplicated network, one for predict next action, the other used for evaluation. To avoid “chasing own tail” condition and make training stable. Deep Q-Learning
  • 19.
    ● Initialize ○ Notethat there’s 2 identical network: eval_net, target_net Deep Q-Learning
  • 20.
    ● Choose action:still about exploration/exploitation Deep Q-Learning
  • 21.
    ● Training process ○Learn every N step, better than policy network ○ Update state in the end of episode Deep Q-Learning
  • 22.
    ● Learning process: ○ eval_net pursuing target_net for a batch replay ○ Then copy eval_net to target_net Deep Q-Learning
  • 23.
    ● Deep Q-Learningnetwork ○ DEMO - Google Colab notebook Deep Q-Learning
  • 24.
    Reinforcement Learning -IV Deep Deterministic Policy Gradient
  • 25.
    ● Main concept ○Actor-Critic ■ Actor as policy gradients, good at choosing from continuous action space. ■ Critic as value based (like Q-learning), update by step, good learning efficiency. DDPG
  • 26.
    ● DDPG :Actor & Critic (minimum version) DDPG
  • 27.
    ● DDPG trainer: coordinate 2 actors + 2 critics (each have eval/target) DDPG
  • 28.
    ● DDPG :choosing actions by explore & exploit DDPG
  • 29.
    ● DDPG -training DDPG
  • 30.
    ● DDPG -learn DDPG
  • 31.
    ● DDPG -soft_update / hard_update DDPG
  • 32.
    ● DDPG ○ DEMO- Google Colab notebook DDPG