SlideShare a Scribd company logo
1 of 28
Reinforcement
Learning
Ding Li 2021.3
Reinforcement Learning:
Learning from Interaction
Learn how to map situations to actions,
to maximize rewards.
The learner is not told which actions to take,
but instead must discover which actions yield
the most rewards by trying them.
3
MULTI-ARMED BANDITS
How can an agent get as many rewards as
possible in limited time?
The rewards at each arm varies each time, but follows
a normal distribution, which is unknow to the agent
Epsilons Greedy: Balance of Exploration and Exploitation
10% probability (↋): randomly choosing an action (exploration)
90% probability (1-↋): taking the action with current maximum
reward (greedy, exploitation)
Estimated Average Reward after an action was chosen n times:
𝑄𝑛+1 =
𝑅1 + 𝑅2 + ⋯ + 𝑅𝑛
𝑛
= 𝑄𝑛 +
1
𝑛
𝑅𝑛 − 𝑄𝑛
Large ↋ learns fast
in the beginning
Final reward not high
due to exploration
Nonstationary Problem (the best arm changes over time) :
𝑄𝑛+1 = 𝑄𝑛 + 𝛼 𝑅𝑛 − 𝑄𝑛 𝛼: fixed step size
The new reward weights more than the old.
Weight of old reward decays exponentially.
Python code
4
Agent: The leaner who interacts with Environment
State: Specific environment setting the agent is sensing
Action: What an agent can do to affect its state and generate reward
Policy: Map states to actions
Episode: agent and environment interact at time steps, t = 0, 1, 2, 3, ……, T (last time
step)
S0 S1
a0
S2
a1
S3
a2 a3
MDP: probability of St and Rt only depends on previous state and action.
MDP for Recycling Robot (state is battery level)
Return of step t for episodic task
Return of step t for continuous task
T is the terminal time step
r1 r2 r3 Policy 𝜋 𝑎|𝑠 : probability of taking action a at state s
deterministic: a is unique
stochastic: a is not unique
ST
…
5
State-value function: expected return when starting in s and following
policy 𝜋
Action-value function: expected return when starting in s and taking action a
Gridworld: The cells of the grid correspond to the states of the environment. At each cell, four actions are possible:
north, south, east, and west. Actions that would take the agent out the grid leave its location unchanged, but also result in a reward
of −1. Other actions result in a reward of 0, except those that move the agent out of the special states A and B. γ = 0.9.
Bellman equation
Bellman optimality equation
Optimal Policies 𝜋∗: achieves the highest value possible in every state
Optimal state-value function 𝑣∗ 𝑠 = 𝑣𝜋∗
𝑠
Optimal action-value function
6
Iterative policy evaluation (Bootstrapping):
Random initialization, continue the iteration until
converge
Policy Improvement (greedy policy):
Policy Iteration
Value Iteration (combine policy evaluation and policy
improvement)
Generalized Policy Iteration (GPI)
The general idea of letting
policy-evaluation and policy
improvement processes
interact, independent of the
granularity and other details
of the two processes
Asynchronous Dynamic Programming
A sweep of all states can be prohibitively expensive.
Asynchronous DP can update values of some states during one
sweep, and can focus on relevant states
Example of Policy
Iteration
Python code
7
Estimate state values by averaging returns from experience
Estimate action values from experience
Exploring start: all state-action pairs might be selected as the start
Monte Carlo control
E: Monte Carlo estimation for q given current policy
I: Improve policy by taking greedy action
Blackjack
State: dealer’s card, player sum, usable ace? Action: Stick or Hit
ε-soft policy: exploration at each step, exploring start is not needed Target policy: the policy being learned, focus on exploitation (π)
Behavior policy: the policy generating behavior, focus on exploration (b)
Off policy learning: learning from behavior “off” the target policy (π≠b)
On policy learning: learning from behavior from the target policy (π=b)
Importance sampling ratio: relative probability of trajectories under π and b from t to T-1
8
Update state value from next state value, only wait one step
TD error:
Example: Random Walk
TD converge
faster to lower
error than
Monty Carlo
Python code
Sarsa: On-policy TD Control (ε-greedy)
Q-learning: Off-policy TD Control
Expected Sarsa
More stable than Sarsa since it averages the action value of next state,
while Sarsa just uses the actual action of current run, with big randomness.
Off policy: it takes the best action from next state, not the real action taken.
On-policy algorithm can only learn from behaviors generated from current
policy. Data must be abandoned when policy updated. But any data can be
reused in off-policy algorithm, which is more sample efficient.
State-Action-Reward-State-Action
9
Python code
Learn from previous experience
Improve learning efficiency
10
Approximate value of a state s given weight vector w:
Typically, the number of weights is much less than the number of states.
Often μ(s) is chosen to be the fraction of time spent in s.
Mean Squared
Value Error
Stochastic gradient-descent (SGD): update w over time
Monti Carlo: use return value Gt to estimate 𝑣𝜋 𝑠
TD: use to estimate 𝑣𝜋 𝑠
Linear Methods
Corresponding to every state s, there is a real-valued
vector feature vector representing state s
At this TD fixed point
11
Generalization: each tile covers significant state space
Discrimination: multiple tilings overlap each other
Episodic semi-gradient Sarsa: update W to
learn q
Consider the task of driving an underpowered car up
a steep mountain road. The difficulty is that gravity is
stronger than the car’s engine, and even at full
throttle the car cannot accelerate up the steep slope.
The only solution is to first move away from the goal
and up the opposite slope on the left. Then, by
applying full throttle the car can build up enough
inertia to carry it up the steep slope even though it is
slowing down the whole way.
Python code
12
Average Reward: improve stabilization for continuous tasks
Return: difference between rewards and average
reward
13
Neural Networks As Function
Approximation:
Raw
States
Transformed
States
Backpropagation is used to train the weights of the neural
networks.
Challenges:
• The input and output are not given in advance; but obtained through
interaction with environment.
• The representation of current state and future states are linked via neural
networks.
• The correct input is not available
• Feedback is often sparse
• Data generation and training process is coupled
• If function has sharp discontinuities.
Experience Replay:
An experience replay memory stores the k most recent experiences an
agent has gathered. If memory is full, the oldest experience is discarded to
make space for the latest one. Each time an agent trains, one or more
batches of data are sampled random-uniformly from the experience replay
memory. Each of these batches is used in turn to update the parameters of
the Q-function network.
Create
features
automaticall
y
1: Initialize learning rate α
2: Initialize τ
3: Initialize number of batches per training step, B
4: Initialize number of updates per batch, U
5: Initialize batch size N
6: Initialize experience replay memory with max size K
7: Randomly initialize the network parameters θ
8: for m = 1 . . . MAX_STEPS do
9: Gather and store h experiences (si,ai,ri,s′i) using the current policy
10: for b = 1 . . . B do
11: Sample a batch b of experiences from the experience replay memory
12: for u = 1 . . . U do
13: for i = 1 . . . N do
14: # Calculate target Q-values for each example
15: yi=ri+δs′iγ max a′i Qπθ(s′i,a′i) where δs′i=0 if s′i is terminal,
1 otherwise
16: end for
17: # Calculate the loss, for example using MSE
18: L(θ)=1N Σi(yi−Qπθ(si,ai))2
19: # Update the network’s parameters
20: θ = θ − α∇θL(θ)
21: end for
22: end for
23: Decay τ
24: end for
DQN Algorithm
14
Target Networks
When the network parameters θ is updated to minimize
the difference between target value and calculated
value, the target value is also changed.
Target network is a second network with parameters φ which is a
lagged copy of θ. It reduces the changes in target value between
training steps.
Periodically, φ is updated to the current values of θ.
Double DQN
will be positively biased due to errors in each
=
DQN overestimates Qπ(s, a) for the (s, a) pairs that have been visited often.
The incorrect relative Q-values will be propagated backwards in time to
earlier (s, a) pairs and add error to those estimates as well. It is
therefore beneficial to reduce the overestimation of Q-values.
The training network θ is used to select the action.
The target network φ is used to evaluate that action.
Prioritized Experience Replay (PER)
Some experiences in the replay memory are more informative than others
ωi is the TD error for experience i, ε is a small positive number
priority for experience i
Experiment with Atari Game Pong
Double DQN + PER performs the best.
This is closely followed by DQN + PER, then Double DQN, and DQN.
15
Ѳ: policy’s parameter vector, determine action from state
Parameter for action
value
Parameter for policy
Soft-max distribution
Policy parameterization has finer control on action than ε-
greedy
• Can autonomously decrease exploration over time.
• Fit stochastic application well (need action distribution)
• Sometime the policy is less complicated than the value function
The Policy Gradient Theorem for performance measure J
16
Apply TD to REINFORCE Pendulum Swing Up
Tile-Coding Features
SoftMax Policy:
Early learning Final learning
Python code
17
Foundations of Deep Reinforcement Learning
18
Silver 2016
19
Silver 2016
At each time step t of each simulation, an action at is
selected from state st
The bonus proportional to the prior probability but
decays with repeated visits to encourage exploration
The leaf node is evaluated by the value network vθ(sL) and the outcome zL of a
random rollout played out until terminal step T using the fast rollout policy pπ
20
Silver 2017
• It is trained solely by self-play reinforcement learning, starting from
random play, without any supervision or use of human data.
• It uses only the black and white stones from the board as input features.
• It uses a single neural network, rather than separate policy and value
networks.
• It uses a simpler tree search that relies upon this single neural network
to evaluate positions and sample moves, without performing any Monte
Carlo rollouts.
• A new reinforcement learning algorithm was introduced that
incorporates lookahead search inside the training loop, resulting in rapid
improvement and precise and stable learning
21
Silver 2017
The neural network (p, v)=fθ (s) is adjusted to minimize the error between the
predicted value v and the self-play winner z, and to maximize the similarity of
the neural network move probabilities p to the search probabilities π.
REINFORCEMENT
LEARNING
FRAMEWORKS
22
23
Agents GitHub
TF-Agents makes implementing, deploying, and testing new Bandits and RL algorithms easier. It
provides well tested and modular components that can be modified and extended. It enables
fast code iteration, with good test integration and benchmarking.
Algorithms
• DQN: Human level control through deep reinforcement learning Mnih et al., 2015
• DDQN: Deep Reinforcement Learning with Double Q-learning Hasselt et al., 2015
• DDPG: Continuous control with deep reinforcement learning Lillicrap et al., 2015
• TD3: Addressing Function Approximation Error in Actor-Critic Methods Fujimoto et al., 2018
• REINFORCE: Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement
Learning Williams, 1992
• PPO: Proximal Policy Optimization Algorithms Schulman et al., 2017
• SAC: Soft Actor Critic Haarnoja et al., 2018
Tutorials
• Introduction to RL
• DQN
24
RAY: Distributed Computing for ML
Researchers in UC Berkeley created the framework
To speed up ML training with CPU cores or cluster nodes
API in Python, core in C++
High level libraries use Ray internally for distributed computation
Ray GitHub
RlLib: Scalable Reinforcement Learning
import ray
import ray.rllib.agents.ppo as ppo
ray.init()
SELECT_ENV = "CartPole-v1"
config = ppo.DEFAULT_CONFIG.copy()
config['num_workers'] = 8
config['model']['fcnet_hiddens'] = [40,20]
trainer = ppo.PPOTrainer(config, SELECT_ENV)
for n in range(20):
result = trainer.train()
print(result['episode_reward_mean'])
Ray Tune: Hyperparameter Tuning
import ray
from ray import tune
ray.init()
tune.run(
"PPO",
stop={"episode_reward_mean": 400},
config={
"env": "CartPole-v1",
"num_gpus": 0,
"num_workers": 6,
"model": {
'fcnet_hiddens': [
tune.grid_search([20, 40, 60, 80]),
tune.grid_search([20, 40, 60, 80])]},},)
25
Policy: calculate action from state
Colab RlLib
Environment
import gym
env = gym.make('CartPole-v0’)
# Choose an action (either 0 or 1).
def sample_policy(state):
return 0 if state[0] < 0 else 1
def rollout_policy(env, policy):
state = env.reset()
done = False
cumulative_reward = 0
while not done:
action = policy(state)
# Take the action in the environment.
state, reward, done, _ = env.step(action)
cumulative_reward += reward
return cumulative_reward
reward = rollout_policy(env, sample_policy)
import gym, ray
from ray.rllib.agents import ppo
class MyEnv(gym.Env):
def __init__(self, env_config):
self.action_space = <gym.Space>
self.observation_space = <gym.Space>
def reset(self):
return <obs>
def step(self, action):
return <obs>, <reward: float>, <done: bool>, <info: dict>
ray.init()
trainer = ppo.PPOTrainer(env=MyEnv, config={
"env_config": {}, # config to pass to env class
})
while True:
print(trainer.train())
Algorithms
26
SLM Lab GitHub
SLM Lab is a software framework for reproducible
reinforcement learning (RL) research. It enables easy
development of RL algorithms using modular components
and file-based configuration. It also enables flexible
experimentation completed with hyperparameter search,
result analysis and benchmark results.
SLM Lab is also the companion library of the book
Foundations of Deep Reinforcement Learning.
27
A simple REINFORCE car pole spec file
1 #
slm_lab/spec/benchmark/reinforce/reinforce_cartpole.js
on
2
3 {
4 "reinforce_cartpole": {
5 "agent": [{
6 "name": "Reinforce",
7 "algorithm": {
8 "name": "Reinforce",
9 "action_pdtype": "default",
10 "action_policy": "default",
11 "center_return": true,
12 "explore_var_spec": null,
13 "gamma": 0.99,
14 "entropy_coef_spec": {
15 "name": "linear_decay",
16 "start_val": 0.01,
17 "end_val": 0.001,
18 "start_step": 0,
19 "end_step": 20000,
20 },
21 "training_frequency": 1
22 },
23 "memory": {
24 "name": "OnPolicyReplay"
25 },
26 "net": {
27 "type": "MLPNet",
28 "hid_layers": [64],
29 "hid_layers_activation": "selu",
30 "clip_grad_val": null,
31 "loss_spec": {
32 "name": "MSELoss"
33 },
34 "optim_spec": {
35 "name": "Adam",
36 "lr": 0.002
37 },
38 "lr_scheduler_spec": null
39 }
40 }],
41 "env": [{
42 "name": "CartPole-v0",
43 "max_t": null,
44 "max_frame": 100000,
45 }],
46 "body": {
47 "product": "outer",
48 "num": 1
49 },
50 "meta": {
51 "distributed": false,
52 "eval_frequency": 2000,
53 "max_session": 4,
54 "max_trial": 1,
55 },
56 ...
57 }
58 }
REINFORCE spec file with search spec for different gamma values
1 # slm_lab/spec/benchmark/reinforce/reinforce_cartpole.json
2
3 {
4 "reinforce_cartpole": {
5 ...
6 "meta": {
7 "distributed": false,
8 "eval_frequency": 2000,
9 "max_session": 4,
10 "max_trial": 1,
11 },
12 "search": {
13 "agent": [{
14 "algorithm": {
15 "gamma__grid_search": [0.1, 0.5, 0.7, 0.8, 0.90, 0.99, 0.999]
16 }
17 }]
18 }
19 }
20 }
γ values above 0.90 perform better,
with γ = 0.999 from trial 6 giving the
best result. When γ is too low, the
algorithm fails to learn a policy that
solves the problem, and the learning
curve stays flat.
28
 Coursera
Reinforcement Learning Specialization
 Books
Reinforcement Learning, second edition: An Introduction
Foundations of Deep Reinforcement Learning
Hands-On Reinforcement Learning for Games
What Is Ray?

More Related Content

What's hot

Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningKhaled Saleh
 
An introduction to deep reinforcement learning
An introduction to deep reinforcement learningAn introduction to deep reinforcement learning
An introduction to deep reinforcement learningBig Data Colombia
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learningshivani saluja
 
DQN (Deep Q-Network)
DQN (Deep Q-Network)DQN (Deep Q-Network)
DQN (Deep Q-Network)Dong Guo
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learningDongHyun Kwak
 
Reinforcement Learning Q-Learning
Reinforcement Learning   Q-Learning Reinforcement Learning   Q-Learning
Reinforcement Learning Q-Learning Melaku Eneayehu
 
Reinforcement Learning In AI Powerpoint Presentation Slide Templates Complete...
Reinforcement Learning In AI Powerpoint Presentation Slide Templates Complete...Reinforcement Learning In AI Powerpoint Presentation Slide Templates Complete...
Reinforcement Learning In AI Powerpoint Presentation Slide Templates Complete...SlideTeam
 
Actor critic algorithm
Actor critic algorithmActor critic algorithm
Actor critic algorithmJie-Han Chen
 
Deep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-LearningDeep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-LearningKai-Wen Zhao
 
Introduction to Deep Learning
Introduction to Deep LearningIntroduction to Deep Learning
Introduction to Deep LearningOswald Campesato
 
Intro to Reinforcement learning - part III
Intro to Reinforcement learning - part IIIIntro to Reinforcement learning - part III
Intro to Reinforcement learning - part IIIMikko Mäkipää
 
Reinforcement learning
Reinforcement  learningReinforcement  learning
Reinforcement learningSKS
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learningbutest
 
Reinforcement learning slides
Reinforcement learning slidesReinforcement learning slides
Reinforcement learning slidesOmranHakami
 
Markov decision process
Markov decision processMarkov decision process
Markov decision processHamed Abdi
 
Artificial Intelligence: What Is Reinforcement Learning?
Artificial Intelligence: What Is Reinforcement Learning?Artificial Intelligence: What Is Reinforcement Learning?
Artificial Intelligence: What Is Reinforcement Learning?Bernard Marr
 
Deep reinforcement learning from scratch
Deep reinforcement learning from scratchDeep reinforcement learning from scratch
Deep reinforcement learning from scratchJie-Han Chen
 

What's hot (20)

Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement Learning
 
An introduction to deep reinforcement learning
An introduction to deep reinforcement learningAn introduction to deep reinforcement learning
An introduction to deep reinforcement learning
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
DQN (Deep Q-Network)
DQN (Deep Q-Network)DQN (Deep Q-Network)
DQN (Deep Q-Network)
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Reinforcement Learning Q-Learning
Reinforcement Learning   Q-Learning Reinforcement Learning   Q-Learning
Reinforcement Learning Q-Learning
 
Reinforcement Learning In AI Powerpoint Presentation Slide Templates Complete...
Reinforcement Learning In AI Powerpoint Presentation Slide Templates Complete...Reinforcement Learning In AI Powerpoint Presentation Slide Templates Complete...
Reinforcement Learning In AI Powerpoint Presentation Slide Templates Complete...
 
Actor critic algorithm
Actor critic algorithmActor critic algorithm
Actor critic algorithm
 
Deep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-LearningDeep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-Learning
 
Introduction to Deep Learning
Introduction to Deep LearningIntroduction to Deep Learning
Introduction to Deep Learning
 
Intro to Reinforcement learning - part III
Intro to Reinforcement learning - part IIIIntro to Reinforcement learning - part III
Intro to Reinforcement learning - part III
 
Reinforcement learning
Reinforcement  learningReinforcement  learning
Reinforcement learning
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Reinforcement learning slides
Reinforcement learning slidesReinforcement learning slides
Reinforcement learning slides
 
Deep Q-Learning
Deep Q-LearningDeep Q-Learning
Deep Q-Learning
 
Support Vector Machines ( SVM )
Support Vector Machines ( SVM ) Support Vector Machines ( SVM )
Support Vector Machines ( SVM )
 
Markov decision process
Markov decision processMarkov decision process
Markov decision process
 
Artificial Intelligence: What Is Reinforcement Learning?
Artificial Intelligence: What Is Reinforcement Learning?Artificial Intelligence: What Is Reinforcement Learning?
Artificial Intelligence: What Is Reinforcement Learning?
 
Deep reinforcement learning from scratch
Deep reinforcement learning from scratchDeep reinforcement learning from scratch
Deep reinforcement learning from scratch
 

Similar to Reinforcement learning

Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017MLconf
 
reinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdfreinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdfVaishnavGhadge1
 
An efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningAn efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningPrabhu Kumar
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learningElias Hasnat
 
24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptxManiMaran230751
 
Reinfrocement Learning
Reinfrocement LearningReinfrocement Learning
Reinfrocement LearningNatan Katz
 
Aaa ped-24- Reinforcement Learning
Aaa ped-24- Reinforcement LearningAaa ped-24- Reinforcement Learning
Aaa ped-24- Reinforcement LearningAminaRepo
 
Head First Reinforcement Learning
Head First Reinforcement LearningHead First Reinforcement Learning
Head First Reinforcement Learningazzeddine chenine
 
Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017
Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017
Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017MLconf
 
Reinforcement Learning Guide For Beginners
Reinforcement Learning Guide For BeginnersReinforcement Learning Guide For Beginners
Reinforcement Learning Guide For Beginnersgokulprasath06
 
Demystifying deep reinforement learning
Demystifying deep reinforement learningDemystifying deep reinforement learning
Demystifying deep reinforement learning재연 윤
 
reinforcement-learning-141009013546-conversion-gate02.pptx
reinforcement-learning-141009013546-conversion-gate02.pptxreinforcement-learning-141009013546-conversion-gate02.pptx
reinforcement-learning-141009013546-conversion-gate02.pptxMohibKhan79
 
14_ReinforcementLearning.pptx
14_ReinforcementLearning.pptx14_ReinforcementLearning.pptx
14_ReinforcementLearning.pptxRithikRaj25
 
REINFORCEMENT LEARNING
REINFORCEMENT LEARNINGREINFORCEMENT LEARNING
REINFORCEMENT LEARNINGpradiprahul
 

Similar to Reinforcement learning (20)

Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
 
Introduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement LearningIntroduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement Learning
 
reinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdfreinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdf
 
RL.ppt
RL.pptRL.ppt
RL.ppt
 
An efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningAn efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game Learning
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx
 
Reinfrocement Learning
Reinfrocement LearningReinfrocement Learning
Reinfrocement Learning
 
Deep einforcement learning
Deep einforcement learningDeep einforcement learning
Deep einforcement learning
 
Aaa ped-24- Reinforcement Learning
Aaa ped-24- Reinforcement LearningAaa ped-24- Reinforcement Learning
Aaa ped-24- Reinforcement Learning
 
Head First Reinforcement Learning
Head First Reinforcement LearningHead First Reinforcement Learning
Head First Reinforcement Learning
 
Playing Atari with Deep Reinforcement Learning
Playing Atari with Deep Reinforcement LearningPlaying Atari with Deep Reinforcement Learning
Playing Atari with Deep Reinforcement Learning
 
Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017
Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017
Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017
 
Reinforcement Learning Guide For Beginners
Reinforcement Learning Guide For BeginnersReinforcement Learning Guide For Beginners
Reinforcement Learning Guide For Beginners
 
Demystifying deep reinforement learning
Demystifying deep reinforement learningDemystifying deep reinforement learning
Demystifying deep reinforement learning
 
reinforcement-learning-141009013546-conversion-gate02.pptx
reinforcement-learning-141009013546-conversion-gate02.pptxreinforcement-learning-141009013546-conversion-gate02.pptx
reinforcement-learning-141009013546-conversion-gate02.pptx
 
14_ReinforcementLearning.pptx
14_ReinforcementLearning.pptx14_ReinforcementLearning.pptx
14_ReinforcementLearning.pptx
 
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
 
REINFORCEMENT LEARNING
REINFORCEMENT LEARNINGREINFORCEMENT LEARNING
REINFORCEMENT LEARNING
 
ML MODULE 4.pdf
ML MODULE 4.pdfML MODULE 4.pdf
ML MODULE 4.pdf
 

More from Ding Li

Software architecture for data applications
Software architecture for data applicationsSoftware architecture for data applications
Software architecture for data applicationsDing Li
 
Seismic data analysis with u net
Seismic data analysis with u netSeismic data analysis with u net
Seismic data analysis with u netDing Li
 
Titanic survivor prediction by machine learning
Titanic survivor prediction by machine learningTitanic survivor prediction by machine learning
Titanic survivor prediction by machine learningDing Li
 
Find nuclei in images with U-net
Find nuclei in images with U-netFind nuclei in images with U-net
Find nuclei in images with U-netDing Li
 
Digit recognizer by convolutional neural network
Digit recognizer by convolutional neural networkDigit recognizer by convolutional neural network
Digit recognizer by convolutional neural networkDing Li
 
Recommendation system
Recommendation systemRecommendation system
Recommendation systemDing Li
 
Practical data science
Practical data sciencePractical data science
Practical data scienceDing Li
 
Generative adversarial networks
Generative adversarial networksGenerative adversarial networks
Generative adversarial networksDing Li
 
AI to advance science research
AI to advance science researchAI to advance science research
AI to advance science researchDing Li
 
Machine learning with graph
Machine learning with graphMachine learning with graph
Machine learning with graphDing Li
 
Natural language processing and transformer models
Natural language processing and transformer modelsNatural language processing and transformer models
Natural language processing and transformer modelsDing Li
 
Great neck school budget 2016-2017 analysis
Great neck school budget 2016-2017 analysisGreat neck school budget 2016-2017 analysis
Great neck school budget 2016-2017 analysisDing Li
 
Business Intelligence and Big Data in Cloud
Business Intelligence and Big Data in CloudBusiness Intelligence and Big Data in Cloud
Business Intelligence and Big Data in CloudDing Li
 

More from Ding Li (13)

Software architecture for data applications
Software architecture for data applicationsSoftware architecture for data applications
Software architecture for data applications
 
Seismic data analysis with u net
Seismic data analysis with u netSeismic data analysis with u net
Seismic data analysis with u net
 
Titanic survivor prediction by machine learning
Titanic survivor prediction by machine learningTitanic survivor prediction by machine learning
Titanic survivor prediction by machine learning
 
Find nuclei in images with U-net
Find nuclei in images with U-netFind nuclei in images with U-net
Find nuclei in images with U-net
 
Digit recognizer by convolutional neural network
Digit recognizer by convolutional neural networkDigit recognizer by convolutional neural network
Digit recognizer by convolutional neural network
 
Recommendation system
Recommendation systemRecommendation system
Recommendation system
 
Practical data science
Practical data sciencePractical data science
Practical data science
 
Generative adversarial networks
Generative adversarial networksGenerative adversarial networks
Generative adversarial networks
 
AI to advance science research
AI to advance science researchAI to advance science research
AI to advance science research
 
Machine learning with graph
Machine learning with graphMachine learning with graph
Machine learning with graph
 
Natural language processing and transformer models
Natural language processing and transformer modelsNatural language processing and transformer models
Natural language processing and transformer models
 
Great neck school budget 2016-2017 analysis
Great neck school budget 2016-2017 analysisGreat neck school budget 2016-2017 analysis
Great neck school budget 2016-2017 analysis
 
Business Intelligence and Big Data in Cloud
Business Intelligence and Big Data in CloudBusiness Intelligence and Big Data in Cloud
Business Intelligence and Big Data in Cloud
 

Recently uploaded

Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 

Recently uploaded (20)

Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 

Reinforcement learning

  • 2. Reinforcement Learning: Learning from Interaction Learn how to map situations to actions, to maximize rewards. The learner is not told which actions to take, but instead must discover which actions yield the most rewards by trying them.
  • 3. 3 MULTI-ARMED BANDITS How can an agent get as many rewards as possible in limited time? The rewards at each arm varies each time, but follows a normal distribution, which is unknow to the agent Epsilons Greedy: Balance of Exploration and Exploitation 10% probability (↋): randomly choosing an action (exploration) 90% probability (1-↋): taking the action with current maximum reward (greedy, exploitation) Estimated Average Reward after an action was chosen n times: 𝑄𝑛+1 = 𝑅1 + 𝑅2 + ⋯ + 𝑅𝑛 𝑛 = 𝑄𝑛 + 1 𝑛 𝑅𝑛 − 𝑄𝑛 Large ↋ learns fast in the beginning Final reward not high due to exploration Nonstationary Problem (the best arm changes over time) : 𝑄𝑛+1 = 𝑄𝑛 + 𝛼 𝑅𝑛 − 𝑄𝑛 𝛼: fixed step size The new reward weights more than the old. Weight of old reward decays exponentially. Python code
  • 4. 4 Agent: The leaner who interacts with Environment State: Specific environment setting the agent is sensing Action: What an agent can do to affect its state and generate reward Policy: Map states to actions Episode: agent and environment interact at time steps, t = 0, 1, 2, 3, ……, T (last time step) S0 S1 a0 S2 a1 S3 a2 a3 MDP: probability of St and Rt only depends on previous state and action. MDP for Recycling Robot (state is battery level) Return of step t for episodic task Return of step t for continuous task T is the terminal time step r1 r2 r3 Policy 𝜋 𝑎|𝑠 : probability of taking action a at state s deterministic: a is unique stochastic: a is not unique ST …
  • 5. 5 State-value function: expected return when starting in s and following policy 𝜋 Action-value function: expected return when starting in s and taking action a Gridworld: The cells of the grid correspond to the states of the environment. At each cell, four actions are possible: north, south, east, and west. Actions that would take the agent out the grid leave its location unchanged, but also result in a reward of −1. Other actions result in a reward of 0, except those that move the agent out of the special states A and B. γ = 0.9. Bellman equation Bellman optimality equation Optimal Policies 𝜋∗: achieves the highest value possible in every state Optimal state-value function 𝑣∗ 𝑠 = 𝑣𝜋∗ 𝑠 Optimal action-value function
  • 6. 6 Iterative policy evaluation (Bootstrapping): Random initialization, continue the iteration until converge Policy Improvement (greedy policy): Policy Iteration Value Iteration (combine policy evaluation and policy improvement) Generalized Policy Iteration (GPI) The general idea of letting policy-evaluation and policy improvement processes interact, independent of the granularity and other details of the two processes Asynchronous Dynamic Programming A sweep of all states can be prohibitively expensive. Asynchronous DP can update values of some states during one sweep, and can focus on relevant states Example of Policy Iteration Python code
  • 7. 7 Estimate state values by averaging returns from experience Estimate action values from experience Exploring start: all state-action pairs might be selected as the start Monte Carlo control E: Monte Carlo estimation for q given current policy I: Improve policy by taking greedy action Blackjack State: dealer’s card, player sum, usable ace? Action: Stick or Hit ε-soft policy: exploration at each step, exploring start is not needed Target policy: the policy being learned, focus on exploitation (π) Behavior policy: the policy generating behavior, focus on exploration (b) Off policy learning: learning from behavior “off” the target policy (π≠b) On policy learning: learning from behavior from the target policy (π=b) Importance sampling ratio: relative probability of trajectories under π and b from t to T-1
  • 8. 8 Update state value from next state value, only wait one step TD error: Example: Random Walk TD converge faster to lower error than Monty Carlo Python code Sarsa: On-policy TD Control (ε-greedy) Q-learning: Off-policy TD Control Expected Sarsa More stable than Sarsa since it averages the action value of next state, while Sarsa just uses the actual action of current run, with big randomness. Off policy: it takes the best action from next state, not the real action taken. On-policy algorithm can only learn from behaviors generated from current policy. Data must be abandoned when policy updated. But any data can be reused in off-policy algorithm, which is more sample efficient. State-Action-Reward-State-Action
  • 9. 9 Python code Learn from previous experience Improve learning efficiency
  • 10. 10 Approximate value of a state s given weight vector w: Typically, the number of weights is much less than the number of states. Often μ(s) is chosen to be the fraction of time spent in s. Mean Squared Value Error Stochastic gradient-descent (SGD): update w over time Monti Carlo: use return value Gt to estimate 𝑣𝜋 𝑠 TD: use to estimate 𝑣𝜋 𝑠 Linear Methods Corresponding to every state s, there is a real-valued vector feature vector representing state s At this TD fixed point
  • 11. 11 Generalization: each tile covers significant state space Discrimination: multiple tilings overlap each other Episodic semi-gradient Sarsa: update W to learn q Consider the task of driving an underpowered car up a steep mountain road. The difficulty is that gravity is stronger than the car’s engine, and even at full throttle the car cannot accelerate up the steep slope. The only solution is to first move away from the goal and up the opposite slope on the left. Then, by applying full throttle the car can build up enough inertia to carry it up the steep slope even though it is slowing down the whole way. Python code
  • 12. 12 Average Reward: improve stabilization for continuous tasks Return: difference between rewards and average reward
  • 13. 13 Neural Networks As Function Approximation: Raw States Transformed States Backpropagation is used to train the weights of the neural networks. Challenges: • The input and output are not given in advance; but obtained through interaction with environment. • The representation of current state and future states are linked via neural networks. • The correct input is not available • Feedback is often sparse • Data generation and training process is coupled • If function has sharp discontinuities. Experience Replay: An experience replay memory stores the k most recent experiences an agent has gathered. If memory is full, the oldest experience is discarded to make space for the latest one. Each time an agent trains, one or more batches of data are sampled random-uniformly from the experience replay memory. Each of these batches is used in turn to update the parameters of the Q-function network. Create features automaticall y 1: Initialize learning rate α 2: Initialize τ 3: Initialize number of batches per training step, B 4: Initialize number of updates per batch, U 5: Initialize batch size N 6: Initialize experience replay memory with max size K 7: Randomly initialize the network parameters θ 8: for m = 1 . . . MAX_STEPS do 9: Gather and store h experiences (si,ai,ri,s′i) using the current policy 10: for b = 1 . . . B do 11: Sample a batch b of experiences from the experience replay memory 12: for u = 1 . . . U do 13: for i = 1 . . . N do 14: # Calculate target Q-values for each example 15: yi=ri+δs′iγ max a′i Qπθ(s′i,a′i) where δs′i=0 if s′i is terminal, 1 otherwise 16: end for 17: # Calculate the loss, for example using MSE 18: L(θ)=1N Σi(yi−Qπθ(si,ai))2 19: # Update the network’s parameters 20: θ = θ − α∇θL(θ) 21: end for 22: end for 23: Decay τ 24: end for DQN Algorithm
  • 14. 14 Target Networks When the network parameters θ is updated to minimize the difference between target value and calculated value, the target value is also changed. Target network is a second network with parameters φ which is a lagged copy of θ. It reduces the changes in target value between training steps. Periodically, φ is updated to the current values of θ. Double DQN will be positively biased due to errors in each = DQN overestimates Qπ(s, a) for the (s, a) pairs that have been visited often. The incorrect relative Q-values will be propagated backwards in time to earlier (s, a) pairs and add error to those estimates as well. It is therefore beneficial to reduce the overestimation of Q-values. The training network θ is used to select the action. The target network φ is used to evaluate that action. Prioritized Experience Replay (PER) Some experiences in the replay memory are more informative than others ωi is the TD error for experience i, ε is a small positive number priority for experience i Experiment with Atari Game Pong Double DQN + PER performs the best. This is closely followed by DQN + PER, then Double DQN, and DQN.
  • 15. 15 Ѳ: policy’s parameter vector, determine action from state Parameter for action value Parameter for policy Soft-max distribution Policy parameterization has finer control on action than ε- greedy • Can autonomously decrease exploration over time. • Fit stochastic application well (need action distribution) • Sometime the policy is less complicated than the value function The Policy Gradient Theorem for performance measure J
  • 16. 16 Apply TD to REINFORCE Pendulum Swing Up Tile-Coding Features SoftMax Policy: Early learning Final learning Python code
  • 17. 17 Foundations of Deep Reinforcement Learning
  • 19. 19 Silver 2016 At each time step t of each simulation, an action at is selected from state st The bonus proportional to the prior probability but decays with repeated visits to encourage exploration The leaf node is evaluated by the value network vθ(sL) and the outcome zL of a random rollout played out until terminal step T using the fast rollout policy pπ
  • 20. 20 Silver 2017 • It is trained solely by self-play reinforcement learning, starting from random play, without any supervision or use of human data. • It uses only the black and white stones from the board as input features. • It uses a single neural network, rather than separate policy and value networks. • It uses a simpler tree search that relies upon this single neural network to evaluate positions and sample moves, without performing any Monte Carlo rollouts. • A new reinforcement learning algorithm was introduced that incorporates lookahead search inside the training loop, resulting in rapid improvement and precise and stable learning
  • 21. 21 Silver 2017 The neural network (p, v)=fθ (s) is adjusted to minimize the error between the predicted value v and the self-play winner z, and to maximize the similarity of the neural network move probabilities p to the search probabilities π.
  • 23. 23 Agents GitHub TF-Agents makes implementing, deploying, and testing new Bandits and RL algorithms easier. It provides well tested and modular components that can be modified and extended. It enables fast code iteration, with good test integration and benchmarking. Algorithms • DQN: Human level control through deep reinforcement learning Mnih et al., 2015 • DDQN: Deep Reinforcement Learning with Double Q-learning Hasselt et al., 2015 • DDPG: Continuous control with deep reinforcement learning Lillicrap et al., 2015 • TD3: Addressing Function Approximation Error in Actor-Critic Methods Fujimoto et al., 2018 • REINFORCE: Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning Williams, 1992 • PPO: Proximal Policy Optimization Algorithms Schulman et al., 2017 • SAC: Soft Actor Critic Haarnoja et al., 2018 Tutorials • Introduction to RL • DQN
  • 24. 24 RAY: Distributed Computing for ML Researchers in UC Berkeley created the framework To speed up ML training with CPU cores or cluster nodes API in Python, core in C++ High level libraries use Ray internally for distributed computation Ray GitHub RlLib: Scalable Reinforcement Learning import ray import ray.rllib.agents.ppo as ppo ray.init() SELECT_ENV = "CartPole-v1" config = ppo.DEFAULT_CONFIG.copy() config['num_workers'] = 8 config['model']['fcnet_hiddens'] = [40,20] trainer = ppo.PPOTrainer(config, SELECT_ENV) for n in range(20): result = trainer.train() print(result['episode_reward_mean']) Ray Tune: Hyperparameter Tuning import ray from ray import tune ray.init() tune.run( "PPO", stop={"episode_reward_mean": 400}, config={ "env": "CartPole-v1", "num_gpus": 0, "num_workers": 6, "model": { 'fcnet_hiddens': [ tune.grid_search([20, 40, 60, 80]), tune.grid_search([20, 40, 60, 80])]},},)
  • 25. 25 Policy: calculate action from state Colab RlLib Environment import gym env = gym.make('CartPole-v0’) # Choose an action (either 0 or 1). def sample_policy(state): return 0 if state[0] < 0 else 1 def rollout_policy(env, policy): state = env.reset() done = False cumulative_reward = 0 while not done: action = policy(state) # Take the action in the environment. state, reward, done, _ = env.step(action) cumulative_reward += reward return cumulative_reward reward = rollout_policy(env, sample_policy) import gym, ray from ray.rllib.agents import ppo class MyEnv(gym.Env): def __init__(self, env_config): self.action_space = <gym.Space> self.observation_space = <gym.Space> def reset(self): return <obs> def step(self, action): return <obs>, <reward: float>, <done: bool>, <info: dict> ray.init() trainer = ppo.PPOTrainer(env=MyEnv, config={ "env_config": {}, # config to pass to env class }) while True: print(trainer.train()) Algorithms
  • 26. 26 SLM Lab GitHub SLM Lab is a software framework for reproducible reinforcement learning (RL) research. It enables easy development of RL algorithms using modular components and file-based configuration. It also enables flexible experimentation completed with hyperparameter search, result analysis and benchmark results. SLM Lab is also the companion library of the book Foundations of Deep Reinforcement Learning.
  • 27. 27 A simple REINFORCE car pole spec file 1 # slm_lab/spec/benchmark/reinforce/reinforce_cartpole.js on 2 3 { 4 "reinforce_cartpole": { 5 "agent": [{ 6 "name": "Reinforce", 7 "algorithm": { 8 "name": "Reinforce", 9 "action_pdtype": "default", 10 "action_policy": "default", 11 "center_return": true, 12 "explore_var_spec": null, 13 "gamma": 0.99, 14 "entropy_coef_spec": { 15 "name": "linear_decay", 16 "start_val": 0.01, 17 "end_val": 0.001, 18 "start_step": 0, 19 "end_step": 20000, 20 }, 21 "training_frequency": 1 22 }, 23 "memory": { 24 "name": "OnPolicyReplay" 25 }, 26 "net": { 27 "type": "MLPNet", 28 "hid_layers": [64], 29 "hid_layers_activation": "selu", 30 "clip_grad_val": null, 31 "loss_spec": { 32 "name": "MSELoss" 33 }, 34 "optim_spec": { 35 "name": "Adam", 36 "lr": 0.002 37 }, 38 "lr_scheduler_spec": null 39 } 40 }], 41 "env": [{ 42 "name": "CartPole-v0", 43 "max_t": null, 44 "max_frame": 100000, 45 }], 46 "body": { 47 "product": "outer", 48 "num": 1 49 }, 50 "meta": { 51 "distributed": false, 52 "eval_frequency": 2000, 53 "max_session": 4, 54 "max_trial": 1, 55 }, 56 ... 57 } 58 } REINFORCE spec file with search spec for different gamma values 1 # slm_lab/spec/benchmark/reinforce/reinforce_cartpole.json 2 3 { 4 "reinforce_cartpole": { 5 ... 6 "meta": { 7 "distributed": false, 8 "eval_frequency": 2000, 9 "max_session": 4, 10 "max_trial": 1, 11 }, 12 "search": { 13 "agent": [{ 14 "algorithm": { 15 "gamma__grid_search": [0.1, 0.5, 0.7, 0.8, 0.90, 0.99, 0.999] 16 } 17 }] 18 } 19 } 20 } γ values above 0.90 perform better, with γ = 0.999 from trial 6 giving the best result. When γ is too low, the algorithm fails to learn a policy that solves the problem, and the learning curve stays flat.
  • 28. 28  Coursera Reinforcement Learning Specialization  Books Reinforcement Learning, second edition: An Introduction Foundations of Deep Reinforcement Learning Hands-On Reinforcement Learning for Games What Is Ray?