Reinfrocement Learning

Learning Through Interaction
Sutton:
“When an infant plays, waves its arms, or looks about, it has no explicit
teacher, but it does have a direct sensorimotor connection to its
environment. Exercising this connection produces a wealth of
information about cause and effect, about the consequences of
actions, and about what to do in order to achieve goals”
• Reinforcement learning is a computational approach for this type of
learning. It adopts AI perspective to model learning through
interaction.

• As a single (agent) approaches the system, it takes an
action. Upon this action he gets a reward and jumps to
the next state. Online learning becomes plausible
3
Reinforcement Learning

Reinforcement Objective
• Learning the relation between the current situation (state) and the
action to be taken in order to optimize a “payment”
Predicting the expected future reward given the current state (s) :
1. Which actions should we take in order to maximize our gain
2. Which actions should we take in order to maximize the click rate
• The action that is taken influences on the next step “closed loop”
• The learner has to discover which action to take (in ML terminology
we can write the feature vector as some features are function of
others)

RL- Elements
• State (s) - The place where the user agent is right now.
Examples:
1. A position on a chess board
2. A potential customer in a sales web
• Action (a)- An action that a user can take while he is in a state.
Examples:
1. Knight pawn captures bishop
2. The user buys a ticket
• Reward (r) - The reward that is obtained due to the action
Examples:
1. A better worse position
2. More money or more clicks

Basic Elements (Cont)
• Policy (π)- The “strategy” in which the agent decides which action to take.
Abstractly speaking the policy is simply a probability function that is defined for
each state
• Episode – A sequence of states and their actions
• 𝑉π
(𝑠) - The value function of a state 𝑠 when using policy π. Mostly it is the
expected reward (e.g. in a chess the expected final outcome of the game if we
follow a given strategy)
• V(s) - Similar to 𝑉π (𝑠) without a fixed policy (The expected reward over all
possible trajectories starting from 𝑠 )
• Q(s,a) - The analog for V(s) : the planar value function for state s and action a

• We wish to find the best slot machine (best = max reward).
Strategy
Play ! .. and find the machine with the biggest reward (on average)
• At the beginning we pick each slot randomly
• After several steps we gain some knowledge
How do we choose which machine to play?
1. Should we always use the best machine ?
2. Should we pick it randomly?
3. Any other mechanism?
9
Slot Machines n-armed bandit

• The common trade-off
1. Play always with best machine -Exploitation
We may miss better machines due to statistical “noise”
2. Choose machine randomly - Exploration
We don’t take the optimal machine,
Epsilon Greedy
We exploit in probability (1- ε) and explore with probability ε
Typically ε=0.1
10
Exploration ,Exploitation & Epsilon Greedy

• Some problems (like n-bandit) are -Next Best Action.
1. A single given state
2. A set of options that are associated with this state
3. A reward for each action
• Sometimes we wish to learn journeys
Examples:
1. Teach a robot to go from point A to point B
2. Find the fastest way to drive home
11
Episodes

• Episode
1. A “time series” of states {S1, S2, S3.. SK}
2. For all state Si There are set of options {O1, O2,..Oki }
3. Learning formula (the “gradient”) depends not only on the immediate
rewards but on the next state as well
12
Episode (Cont.)

• The observed sequence:
st ,at , Rt+1, st+1 ,at+1 , Rt+2 ,………….., sT ,aT , RT+1 , s-state , a-action, r-reward
• We optimize our goal function (commonly maximizing the average):
Gt = Rt+1 +γRt+2 +γ2 Rt+3 + …… + γ𝑙Rt+l+1 0< γ ≤ 1 –aging factor
Classical Example
The Pole Balancing
Find the exact force to implement
in order to keep the pole up
The reward is 1 for every time step that
The pole didn’t fall
Reinforcement Learning – Foundation

Markov Property
Pr{ St+1 = s’, Rt+1 = r | S0, A0, R1, . . . , St-1, At-1, Rt , St , At }= Pr{ St+1 = s’, Rt+1 = r | St , At }
i.e. : The current state captures the entire history
• Markov processes are fully determined by the transition matrix P
Markov Process (or Markov Chain)
A tuple <S,P> where
S - set of states (mostly finite),
P a state transition probability matrix. Namely: Pss’= P [St+1 = s’ | St = s]
Markov Decision Process -MDP

A Markov Reward Process -MRP (Markov Chain with Values)
A tuple < S,P, R, γ>
S ,P as in Markov process,
R a reward function Rs = E [Rt+1 | St = s]
γ is a discount factor, γ ∈ [0, 1] (as in Gt )
State Value Function for MRP:
v(s) = E [Gt | St = s]
MDP-cont

Bellman Eq.
• v(s) = E [Gt | St = s] = E [Rt+1 + γRt+2 +γ2 Rt+3 +... | St = s]=
E [Rt+1 + γ (Rt+2 + γRt+3+ ...) | St = s] = E [R t+1 + γG t+1 | S t = s ]
We get a recursion rule:
v(s) = E[Rt+1 + γ v(s t+1) | St = s]
Similalry we can define on a value on state-action space:
Q(s,a)= E [Gt | St = s, At =a]
MDP - MRP with a finite set of actions A
MDP-cont

• Recall - policy π is the strategy – it maps between states and actions.
π(a|s) = P [At = a | St = s]
We assume that for each time t ,and state S π( | St) is fixed (π is stationary )
Clearly for a MDP, a given policy π modifies the MDP:
R -> Rπ P->Pπ
We modify V & Q
Vπ(s) = Eπ [G t | S t = s]
Qπ(s,a) = Eπ [G t | S t = s, At =a]
Policy

• For V (or Q) the optimal value function v* ,for each state s :
v*(s) = max
π
vπ(s) π -policy
Solving MDP ≡ Finding the optimal value function!!!!
Optimal Policy
π ≥ π’ if vπ(s) ≥ v π’(s) ∀s
Theorem
For every MDP there exists optimal policy
Optimal Value Function

• If we know 𝑞∗ (s,a) we can find the optimal policy:
Optimal Value (Cont)

• Dynamic programming
• Monte Carlo
• TD methods
Objectives
Prediction - Find the optimal function
Control – Find the optimal policy
Solution Methods

• A class of algorithms used in many applications such as graph theory
(shortest path) and bio informatics. It has two essential properties:
1. Can be decomposed to sub solutions
2. Solutions can be cashed and reused
RL-MDP satisfies these both
• We assume a full knowledge of MDP !!!
Prediction
Input: MDP and policy
Output: Optimal Value function vπ
Control
Input: MDP
Output: Optimal Value function v* Optimal policy π *
Dynamic Programming

• Assume policy π and MDP we wish to find the optima V π(s)
V π(s) = Eπ [Rt+1 + γvπ(St+1) | St =s]
• Since policy and MDP are known it is a linear eq. in vi
but…. Extremely tedious !!!! Let’s do something iterative (Sutton &Barto)
Prediction – Optimal Value Function

• Following the previous algorithm one can use an algorithm (often a greedy algorithm) to improve the
policy which will lead to an optimal function
Policy Improvement (policy iteration)

• Policy iteration requires policy updating which can be heavy.
• We can study 𝑉∗ and obtain the policy through
• The idea is that
• Hence we can find 𝑉∗ iteratively (and derive the optimal policy)
Value Iteration

• The formula supports online update
• Bootstrapping
• Mostly we don’t have MDP
DP -Remarks

• A model free (we don’t need MDP)
1. It learns from generating episodes.
2. It must complete an episode for having the required average.
3. It is unbiased
• For a policy π
S0,A0,R1….. St ~ π
We use empirical mean return rather expected return.
V(St) =V(St) +
1
𝑁(𝑡)
[ Gt –V(St ) ] N(t) – Amount of visits at time t
For non-stationary cases we update differently:
V(St) =V(St) +α [ Gt –V(St ) ]
In MC one must terminate the episode to get the value (we calculate the mean
explicitly ) Hence in grid problems it may work bad
Monte Carlo Methods

• Learn the optimal policy (using Q function):
Monte Carlo Control

Temporal Difference –TD
• Motivation –Combining DP & MC
As MC -Learning from experience , no explicit MDP
As DP- Bootstrapping, no need to complete the episodes
Prediction
Recall that for MC we have
Where Gt is known only at the end of the episode.

TD –Methods (Cont.)
• TD method needs to wait until the next step (TD(0))
We can see that it leads to different targets:
MC- Gt
TD - Rt+1 + γ V(S t+1)
• Hence it is a Bootstrapping method
The estimaion of V given a policy is straightforwad since the policy
chooses S t+1.

TD Vs. MC -Summary
MC
• High variance unbiased
• Good convergence
• Easy to understand
• Low sensitivity for i.c
TD
• Efficiency
• Convergence to V π
• More sensitive to i.c.

SARSA
• On Policy method for Qlearning (update after every step):
The next step is using SARSA to develop also a control algorithm, we
learn on policy the Q function and update the policy toward
greedyness

Qlearning –Off Policy
• Rather learning from an action that has been offered we simply take
the best action for the state
The control algorithm is straightforward

Value Function Approx.
• Sometimes we have a large scale RL
1. TD backgammon (Appendix)
2. GO – (Deep Mind)
3. Helicopter (continuous)
• Our objectives are still :control & predictions but we have huge
amount of states.
• The tabular solutions that we presented are not scalable.
• Value Function approx. will allow us to use models!!!

Value Function (Cont)
• Consider a large (continuous ) MDP
Vπ (s)= 𝑉′
π (s,w)
Qπ (s,a) =𝑄′
π (s,a,w) w –set of function parameters
• We can train them by both TD & MC .
• We can expand values to unseen states

Type of Approximations
1. Linear Combinations
2. Neural networks (lead to DQN)
3. Wavelet solutions

Function Approximation on the technics
• Define features vectors (X(S)) for the state S. e.g.
Distance from target
Trend in stock
Chess board configuration
• Training methods for W
• SGD
Linear Function get the form: 𝑉′
π =<X(S),W>

RL -Based problems
• No supervisor, only rewards solutions become:

Deep -RL
Why using Deep RL?
• It allows us to find an optimal model (value/policy)
• It allows us to optimize a model
• Commonly we will use SGD
Examples
• Automatic cars
• Atari
• Deep Mind
• TD- Gammon

Q – network
• We follow the value function approx. approach
Q(s,a,w)≈𝑄∗(s,a)

Q-Learning
• We will simply follow TD target function with supervised manners:
Target
r+ γmax
𝑎′
𝑄(𝑠′, 𝑎′, 𝑤)
Loss -MSE
(r+ γ max
𝑎′
𝑄(𝑠′, 𝑎′, 𝑤) −Q(s,a,w) )2
• We solve it with SGD

Q Network –Stability Issues
Divergences
• Correlation between successive samples ,non-iid
• Policy is not necessarily stationary (influences on Q value)
• Scale of rewards and Q value is unknown

Deep –Q network
Experience Replay
Replay the data from the past with the current W
It allows to remove correlation in data:
• Pick at upon a greedy algorithm
• Store in memory the tuple(st, at, rt+1, st+1 ) - Replay
• Now calculate the MSE

DQN (Cont)
Fixed Target Q- Network
In order to handle oscillations
We calculate targets with respect to old parameters 𝑤−
r+ γ max
𝑎′
𝑄(𝑠′, 𝑎′, 𝑤− )
The loss becomes
(r+ γ max
𝑎′
𝑄(𝑠′
, 𝑎′
, 𝑤−
) −Q(s,a,w) )2
𝑤−
<- w

DQN –Summary
Many further methods:
• RewardValue
• Double DQN
• Parallel Updates
Requires another lecture

Gradient Policy
• We have discussed:
1. Function approximations
2. Algorithms in which policy is learned through the value functions
We can parametrize policy using parameters θ :
πθ (s, a) =P[a| s, θ]
Remark: we focus on model free!!

Policy Based Good & Bad
Good
Better in High dimensions
Convergence faster
Bad
Less efficient for high variance
Local minima
Example: Rock-Paper-Scissors

How to optimize a policy?
• We assume it is differentiable and calculate the log-likelihood
• We assume further Gibbs distribution i.e.
policy exponent in value function
πθ (s, a) α 𝑒−θΦ(𝑠,𝑎)
Deriving by θ implies:
We can also use Gaussian policy

Optimize policy (Cont.)
Actor-Critic
Critic – Update the action-state function by w
Actor –Update the policy θ upon the critic suggestion

• Rather Learning value functions we learn probabilities. Let At the action
that is taken at time t
Pr(At =a) = πt (a) =
𝑒Ht (a)
𝑏=1
𝑘
𝑒Ht (b)
H – Numerical Preference
We assume Gibbs Boltzmann Distribution
R¯t - The average until time t
Rt - The reward at time t
Ht+1(At) = Ht(At) + α (Rt − R¯t )(1 − πt(At) )
Ht+1(a) = Ht(a) − α (Rt − R¯t ) πt(a) ∀a ≠ At
Gradient Bandit algorithm

Further Reading
• Sutton & Barto
http://ufal.mff.cuni.cz/~straka/courses/npfl114/2016/sutton-
bookdraft2016sep.pdf
Pole balancing - https://www.youtube.com/watch?v=Lt-KLtkDlh8
• DeepMind papers
• David Silver –Youtube and ucl.ac.uk
• TD-Backgammon

Reinfrocement Learning

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Reinfrocement Learning

Similar to Reinfrocement Learning (20)

More from Natan Katz

More from Natan Katz (16)

Recently uploaded

Recently uploaded (20)

Reinfrocement Learning