Introduction to reinforcement learning

Introduction to
Reinforcement Learning
Marsan Ma
2019.10.16

● Controller interactive with Environment
● Evaluate the performance, give a reward to make controller learn better strategy.
Idea
Reward

Frameworks
● Value based
○ Q-Learning
○ Deep Q-Learning family
○ Sarsa (state-action-reward-state-action ...)
● Policy based
○ Policy gradient
● Actor-Critic
○ DDPG (deep deterministic policy gradient)

Reinforcement Learning - I
Q Learning

● Q-Learning
○ The “model” is actually a table of f(State, Action) = Reward
○ Example: a cursor V start from beginning of the line O, trying to move to the end of the line
V _ _ _ _ _ O
○ Target: we want a f(State, Action) = Reward table like this to guide cursor move toward the right:
(row as position, column as action, cell as reward value)
Q-Learning : task

● Q-Learning
○ Initialize a Q-table, which we’re gonna train it.
○ Choosing action greedy with probability EPSILON (exploitation)
Choosing action randomly with probability 1-EPSILON (exploration)
Q-Learning : how Q-table work

● Q-Learning
○ Get interact result (newState, Reward) from environment.
Q-Learning : environment react to action

● Q-Learning
○ Do the training by replay scenario for MAX_EPISODES times.
■ Update Q-table during the training, since that’s your final “model”.
Q-Learning : how model trained

● Q-table updating
○ If next_state is terminal, claim whole reward as q_value
○ If next_state is not terminal, claim reward + decayed estimated future value
○ Update Q-table value with learning_rate=alpha
Q-Learning : how Q-table updated

● Q-Learning
○ DEMO - Google Colab notebook
Q-Learning

Reinforcement Learning - II
Policy Gradient

● Main concept
○ Ignoring the “value of action”, only have actions and it’s probability
(value based like Q-learning, you need huge table to store all actions)
○ No longer need “Epsilon” to do explore/exploit, since it natively choose action by probability.
Policy Gradient

● Controller initialization
○ Very simple network, since there’s no fancy data characteristic in this toy task
Policy Gradient : initialize

● Training Process
○ Learn only after episode ended, data efficiency is low.
Policy Gradient : training

● Training Process
○ Rewards are degraded by time, then scaled.
○ Loss is decided by past policy history and reward, then back propagate to update the network.
○ Loss = how “surprise” you are to the reward = -reward*chosen_policy
Policy Gradient : update network

● Policy Gradient
Policy Gradient

Reinforcement Learning - III
Deep Q Learning

● Main concept
○ In real world, search space of states / actions might be too large to be stored.
○ Rather than storing all combinations as Q-table, how about summarize them into a model?
=> Neural Network is good at representing arbitrary complex functions.
● Other reasons why DQN powerful
○ Experience replay
■ Learn from past experience as many times as you wish
○ Fixed Q-target
■ The improved version of Q-learning, solving the “chasing tail” problem while doing eval and
target with the same network.
■ Use 2 duplicated network, one for predict next action, the other used for evaluation. To
avoid “chasing own tail” condition and make training stable.
Deep Q-Learning

● Initialize
○ Note that there’s 2 identical network: eval_net, target_net
Deep Q-Learning

● Choose action: still about exploration/exploitation
Deep Q-Learning

● Training process
○ Learn every N step, better than policy network
○ Update state in the end of episode
Deep Q-Learning

● Learning process :
○ eval_net pursuing target_net for a batch replay
○ Then copy eval_net to target_net
Deep Q-Learning

● Deep Q-Learning network
Deep Q-Learning

Reinforcement Learning - IV
Deep Deterministic
Policy Gradient

● Main concept
○ Actor-Critic
■ Actor as policy gradients, good at choosing from continuous action space.
■ Critic as value based (like Q-learning), update by step, good learning efficiency.
DDPG

● DDPG : Actor & Critic (minimum version)
DDPG

● DDPG trainer : coordinate 2 actors + 2 critics (each have eval/target)
DDPG

● DDPG : choosing actions by explore & exploit
DDPG

● DDPG - soft_update / hard_update
DDPG

● DDPG
DDPG

Introduction to reinforcement learning

More Related Content

Similar to Introduction to reinforcement learning

More from Marsan Ma

Recently uploaded

Introduction to reinforcement learning