SlideShare a Scribd company logo
1 of 58
Download to read offline
Reinforcement Learning
Learning Through Interaction
Sutton:
“When an infant plays, waves its arms, or looks about, it has no explicit
teacher, but it does have a direct sensorimotor connection to its
environment. Exercising this connection produces a wealth of
information about cause and effect, about the consequences of
actions, and about what to do in order to achieve goals”
• Reinforcement learning is a computational approach for this type of
learning. It adopts AI perspective to model learning through
interaction.
• As a single (agent) approaches the system, it takes an
action. Upon this action he gets a reward and jumps to
the next state. Online learning becomes plausible
3
Reinforcement Learning
Reinforcement Objective
• Learning the relation between the current situation (state) and the
action to be taken in order to optimize a “payment”
Predicting the expected future reward given the current state (s) :
1. Which actions should we take in order to maximize our gain
2. Which actions should we take in order to maximize the click rate
• The action that is taken influences on the next step “closed loop”
• The learner has to discover which action to take (in ML terminology
we can write the feature vector as some features are function of
others)
RL- Elements
• State (s) - The place where the user agent is right now.
Examples:
1. A position on a chess board
2. A potential customer in a sales web
• Action (a)- An action that a user can take while he is in a state.
Examples:
1. Knight pawn captures bishop
2. The user buys a ticket
• Reward (r) - The reward that is obtained due to the action
Examples:
1. A better worse position
2. More money or more clicks
Basic Elements (Cont)
• Policy (π)- The “strategy” in which the agent decides which action to take.
Abstractly speaking the policy is simply a probability function that is defined for
each state
• Episode – A sequence of states and their actions
• 𝑉π
(𝑠) - The value function of a state 𝑠 when using policy π. Mostly it is the
expected reward (e.g. in a chess the expected final outcome of the game if we
follow a given strategy)
• V(s) - Similar to 𝑉π (𝑠) without a fixed policy (The expected reward over all
possible trajectories starting from 𝑠 )
• Q(s,a) - The analog for V(s) : the planar value function for state s and action a
7
Examples
• Tic Tac Toe
GridWorld (0,-1,10,5)
• We wish to find the best slot machine (best = max reward).
Strategy
Play ! .. and find the machine with the biggest reward (on average)
• At the beginning we pick each slot randomly
• After several steps we gain some knowledge
How do we choose which machine to play?
1. Should we always use the best machine ?
2. Should we pick it randomly?
3. Any other mechanism?
9
Slot Machines n-armed bandit
• The common trade-off
1. Play always with best machine -Exploitation
We may miss better machines due to statistical “noise”
2. Choose machine randomly - Exploration
We don’t take the optimal machine,
Epsilon Greedy
We exploit in probability (1- ε) and explore with probability ε
Typically ε=0.1
10
Exploration ,Exploitation & Epsilon Greedy
• Some problems (like n-bandit) are -Next Best Action.
1. A single given state
2. A set of options that are associated with this state
3. A reward for each action
• Sometimes we wish to learn journeys
Examples:
1. Teach a robot to go from point A to point B
2. Find the fastest way to drive home
11
Episodes
• Episode
1. A “time series” of states {S1, S2, S3.. SK}
2. For all state Si There are set of options {O1, O2,..Oki }
3. Learning formula (the “gradient”) depends not only on the immediate
rewards but on the next state as well
12
Episode (Cont.)
• The observed sequence:
st ,at , Rt+1, st+1 ,at+1 , Rt+2 ,………….., sT ,aT , RT+1 , s-state , a-action, r-reward
• We optimize our goal function (commonly maximizing the average):
Gt = Rt+1 +γRt+2 +γ2 Rt+3 + …… + γ𝑙Rt+l+1 0< γ ≤ 1 –aging factor
Classical Example
The Pole Balancing
Find the exact force to implement
in order to keep the pole up
The reward is 1 for every time step that
The pole didn’t fall
Reinforcement Learning – Foundation
Markov Property
Pr{ St+1 = s’, Rt+1 = r | S0, A0, R1, . . . , St-1, At-1, Rt , St , At }= Pr{ St+1 = s’, Rt+1 = r | St , At }
i.e. : The current state captures the entire history
• Markov processes are fully determined by the transition matrix P
Markov Process (or Markov Chain)
A tuple <S,P> where
S - set of states (mostly finite),
P a state transition probability matrix. Namely: Pss’= P [St+1 = s’ | St = s]
Markov Decision Process -MDP
A Markov Reward Process -MRP (Markov Chain with Values)
A tuple < S,P, R, γ>
S ,P as in Markov process,
R a reward function Rs = E [Rt+1 | St = s]
γ is a discount factor, γ ∈ [0, 1] (as in Gt )
State Value Function for MRP:
v(s) = E [Gt | St = s]
MDP-cont
Bellman Eq.
• v(s) = E [Gt | St = s] = E [Rt+1 + γRt+2 +γ2 Rt+3 +... | St = s]=
E [Rt+1 + γ (Rt+2 + γRt+3+ ...) | St = s] = E [R t+1 + γG t+1 | S t = s ]
We get a recursion rule:
v(s) = E[Rt+1 + γ v(s t+1) | St = s]
Similalry we can define on a value on state-action space:
Q(s,a)= E [Gt | St = s, At =a]
MDP - MRP with a finite set of actions A
MDP-cont
• Recall - policy π is the strategy – it maps between states and actions.
π(a|s) = P [At = a | St = s]
We assume that for each time t ,and state S π( | St) is fixed (π is stationary )
Clearly for a MDP, a given policy π modifies the MDP:
R -> Rπ P->Pπ
We modify V & Q
Vπ(s) = Eπ [G t | S t = s]
Qπ(s,a) = Eπ [G t | S t = s, At =a]
Policy
• For V (or Q) the optimal value function v* ,for each state s :
v*(s) = max
π
vπ(s) π -policy
Solving MDP ≡ Finding the optimal value function!!!!
Optimal Policy
π ≥ π’ if vπ(s) ≥ v π’(s) ∀s
Theorem
For every MDP there exists optimal policy
Optimal Value Function
• If we know 𝑞∗ (s,a) we can find the optimal policy:
Optimal Value (Cont)
• Dynamic programming
• Monte Carlo
• TD methods
Objectives
Prediction - Find the optimal function
Control – Find the optimal policy
Solution Methods
• A class of algorithms used in many applications such as graph theory
(shortest path) and bio informatics. It has two essential properties:
1. Can be decomposed to sub solutions
2. Solutions can be cashed and reused
RL-MDP satisfies these both
• We assume a full knowledge of MDP !!!
Prediction
Input: MDP and policy
Output: Optimal Value function vπ
Control
Input: MDP
Output: Optimal Value function v* Optimal policy π *
Dynamic Programming
• Assume policy π and MDP we wish to find the optima V π(s)
V π(s) = Eπ [Rt+1 + γvπ(St+1) | St =s]
• Since policy and MDP are known it is a linear eq. in vi
but…. Extremely tedious !!!! Let’s do something iterative (Sutton &Barto)
Prediction – Optimal Value Function
• Following the previous algorithm one can use an algorithm (often a greedy algorithm) to improve the
policy which will lead to an optimal function
Policy Improvement (policy iteration)
• Policy iteration requires policy updating which can be heavy.
• We can study 𝑉∗ and obtain the policy through
• The idea is that
• Hence we can find 𝑉∗ iteratively (and derive the optimal policy)
Value Iteration
• The formula supports online update
• Bootstrapping
• Mostly we don’t have MDP
DP -Remarks
• A model free (we don’t need MDP)
1. It learns from generating episodes.
2. It must complete an episode for having the required average.
3. It is unbiased
• For a policy π
S0,A0,R1….. St ~ π
We use empirical mean return rather expected return.
V(St) =V(St) +
1
𝑁(𝑡)
[ Gt –V(St ) ] N(t) – Amount of visits at time t
For non-stationary cases we update differently:
V(St) =V(St) +α [ Gt –V(St ) ]
In MC one must terminate the episode to get the value (we calculate the mean
explicitly ) Hence in grid problems it may work bad
Monte Carlo Methods
• Learn the optimal policy (using Q function):
Monte Carlo Control
Temporal Difference –TD
• Motivation –Combining DP & MC
As MC -Learning from experience , no explicit MDP
As DP- Bootstrapping, no need to complete the episodes
Prediction
Recall that for MC we have
Where Gt is known only at the end of the episode.
TD –Methods (Cont.)
• TD method needs to wait until the next step (TD(0))
We can see that it leads to different targets:
MC- Gt
TD - Rt+1 + γ V(S t+1)
• Hence it is a Bootstrapping method
The estimaion of V given a policy is straightforwad since the policy
chooses S t+1.
Driving Home Problem
TD Vs. MC -Summary
MC
• High variance unbiased
• Good convergence
• Easy to understand
• Low sensitivity for i.c
TD
• Efficiency
• Convergence to V π
• More sensitive to i.c.
SARSA
• On Policy method for Qlearning (update after every step):
The next step is using SARSA to develop also a control algorithm, we
learn on policy the Q function and update the policy toward
greedyness
On Policy Control Algorithm
Example Windy Grid-World
Qlearning –Off Policy
• Rather learning from an action that has been offered we simply take
the best action for the state
The control algorithm is straightforward
Value Function Approx.
• Sometimes we have a large scale RL
1. TD backgammon (Appendix)
2. GO – (Deep Mind)
3. Helicopter (continuous)
• Our objectives are still :control & predictions but we have huge
amount of states.
• The tabular solutions that we presented are not scalable.
• Value Function approx. will allow us to use models!!!
Value Function (Cont)
• Consider a large (continuous ) MDP
Vπ (s)= 𝑉′
π (s,w)
Qπ (s,a) =𝑄′
π (s,a,w) w –set of function parameters
• We can train them by both TD & MC .
• We can expand values to unseen states
Type of Approximations
1. Linear Combinations
2. Neural networks (lead to DQN)
3. Wavelet solutions
Function Approximation on the technics
• Define features vectors (X(S)) for the state S. e.g.
Distance from target
Trend in stock
Chess board configuration
• Training methods for W
• SGD
Linear Function get the form: 𝑉′
π =<X(S),W>
RL -Based problems
• No supervisor, only rewards solutions become:
Deep -RL
Why using Deep RL?
• It allows us to find an optimal model (value/policy)
• It allows us to optimize a model
• Commonly we will use SGD
Examples
• Automatic cars
• Atari
• Deep Mind
• TD- Gammon
Q – network
• We follow the value function approx. approach
Q(s,a,w)≈𝑄∗(s,a)
Q-Learning
• We will simply follow TD target function with supervised manners:
Target
r+ γmax
𝑎′
𝑄(𝑠′, 𝑎′, 𝑤)
Loss -MSE
(r+ γ max
𝑎′
𝑄(𝑠′, 𝑎′, 𝑤) −Q(s,a,w) )2
• We solve it with SGD
Q Network –Stability Issues
Divergences
• Correlation between successive samples ,non-iid
• Policy is not necessarily stationary (influences on Q value)
• Scale of rewards and Q value is unknown
Deep –Q network
Experience Replay
Replay the data from the past with the current W
It allows to remove correlation in data:
• Pick at upon a greedy algorithm
• Store in memory the tuple(st, at, rt+1, st+1 ) - Replay
• Now calculate the MSE
Experience Repaly
DQN (Cont)
Fixed Target Q- Network
In order to handle oscillations
We calculate targets with respect to old parameters 𝑤−
r+ γ max
𝑎′
𝑄(𝑠′, 𝑎′, 𝑤− )
The loss becomes
(r+ γ max
𝑎′
𝑄(𝑠′
, 𝑎′
, 𝑤−
) −Q(s,a,w) )2
𝑤−
<- w
DQN –Summary
Many further methods:
• RewardValue
• Double DQN
• Parallel Updates
Requires another lecture
Gradient Policy
• We have discussed:
1. Function approximations
2. Algorithms in which policy is learned through the value functions
We can parametrize policy using parameters θ :
πθ (s, a) =P[a| s, θ]
Remark: we focus on model free!!
Policy Based Good & Bad
Good
Better in High dimensions
Convergence faster
Bad
Less efficient for high variance
Local minima
Example: Rock-Paper-Scissors
How to optimize a policy?
• We assume it is differentiable and calculate the log-likelihood
• We assume further Gibbs distribution i.e.
policy exponent in value function
πθ (s, a) α 𝑒−θΦ(𝑠,𝑎)
Deriving by θ implies:
We can also use Gaussian policy
Optimize policy (Cont.)
Actor-Critic
Critic – Update the action-state function by w
Actor –Update the policy θ upon the critic suggestion
• Rather Learning value functions we learn probabilities. Let At the action
that is taken at time t
Pr(At =a) = πt (a) =
𝑒Ht (a)
𝑏=1
𝑘
𝑒Ht (b)
H – Numerical Preference
We assume Gibbs Boltzmann Distribution
R¯t - The average until time t
Rt - The reward at time t
Ht+1(At) = Ht(At) + α (Rt − R¯t )(1 − πt(At) )
Ht+1(a) = Ht(a) − α (Rt − R¯t ) πt(a) ∀a ≠ At
Gradient Bandit algorithm
Further Reading
• Sutton & Barto
http://ufal.mff.cuni.cz/~straka/courses/npfl114/2016/sutton-
bookdraft2016sep.pdf
Pole balancing - https://www.youtube.com/watch?v=Lt-KLtkDlh8
• DeepMind papers
• David Silver –Youtube and ucl.ac.uk
• TD-Backgammon
Thank you

More Related Content

What's hot

Reinforcement learning
Reinforcement learning Reinforcement learning
Reinforcement learning Chandra Meena
 
Reinforcement learning 7313
Reinforcement learning 7313Reinforcement learning 7313
Reinforcement learning 7313Slideshare
 
Discrete sequential prediction of continuous actions for deep RL
Discrete sequential prediction of continuous actions for deep RLDiscrete sequential prediction of continuous actions for deep RL
Discrete sequential prediction of continuous actions for deep RLJie-Han Chen
 
lecture_21.pptx - PowerPoint Presentation
lecture_21.pptx - PowerPoint Presentationlecture_21.pptx - PowerPoint Presentation
lecture_21.pptx - PowerPoint Presentationbutest
 
Deep reinforcement learning from scratch
Deep reinforcement learning from scratchDeep reinforcement learning from scratch
Deep reinforcement learning from scratchJie-Han Chen
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningJungyeol
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningDongHyun Kwak
 
Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningKhaled Saleh
 
Multi armed bandit
Multi armed banditMulti armed bandit
Multi armed banditJie-Han Chen
 
Financial Trading as a Game: A Deep Reinforcement Learning Approach
Financial Trading as a Game: A Deep Reinforcement Learning ApproachFinancial Trading as a Game: A Deep Reinforcement Learning Approach
Financial Trading as a Game: A Deep Reinforcement Learning Approach謙益 黃
 
Temporal difference learning
Temporal difference learningTemporal difference learning
Temporal difference learningJie-Han Chen
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement LearningUsman Qayyum
 
Introduction to Reinforcement Learning
Introduction to Reinforcement LearningIntroduction to Reinforcement Learning
Introduction to Reinforcement LearningEdward Balaban
 
Exploration Strategies in Reinforcement Learning
Exploration Strategies in Reinforcement LearningExploration Strategies in Reinforcement Learning
Exploration Strategies in Reinforcement LearningDongmin Lee
 
An introduction to deep reinforcement learning
An introduction to deep reinforcement learningAn introduction to deep reinforcement learning
An introduction to deep reinforcement learningBig Data Colombia
 

What's hot (20)

Policy gradient
Policy gradientPolicy gradient
Policy gradient
 
Reinforcement learning
Reinforcement learning Reinforcement learning
Reinforcement learning
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
 
Reinforcement learning 7313
Reinforcement learning 7313Reinforcement learning 7313
Reinforcement learning 7313
 
Deep Q-Learning
Deep Q-LearningDeep Q-Learning
Deep Q-Learning
 
Discrete sequential prediction of continuous actions for deep RL
Discrete sequential prediction of continuous actions for deep RLDiscrete sequential prediction of continuous actions for deep RL
Discrete sequential prediction of continuous actions for deep RL
 
lecture_21.pptx - PowerPoint Presentation
lecture_21.pptx - PowerPoint Presentationlecture_21.pptx - PowerPoint Presentation
lecture_21.pptx - PowerPoint Presentation
 
Deep reinforcement learning from scratch
Deep reinforcement learning from scratchDeep reinforcement learning from scratch
Deep reinforcement learning from scratch
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Finalver
FinalverFinalver
Finalver
 
Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement Learning
 
Multi armed bandit
Multi armed banditMulti armed bandit
Multi armed bandit
 
Financial Trading as a Game: A Deep Reinforcement Learning Approach
Financial Trading as a Game: A Deep Reinforcement Learning ApproachFinancial Trading as a Game: A Deep Reinforcement Learning Approach
Financial Trading as a Game: A Deep Reinforcement Learning Approach
 
Temporal difference learning
Temporal difference learningTemporal difference learning
Temporal difference learning
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
 
Introduction to Reinforcement Learning
Introduction to Reinforcement LearningIntroduction to Reinforcement Learning
Introduction to Reinforcement Learning
 
Exploration Strategies in Reinforcement Learning
Exploration Strategies in Reinforcement LearningExploration Strategies in Reinforcement Learning
Exploration Strategies in Reinforcement Learning
 
An introduction to deep reinforcement learning
An introduction to deep reinforcement learningAn introduction to deep reinforcement learning
An introduction to deep reinforcement learning
 

Similar to Reinfrocement Learning

Reinforcement learning, Q-Learning
Reinforcement learning, Q-LearningReinforcement learning, Q-Learning
Reinforcement learning, Q-LearningKuppusamy P
 
Demystifying deep reinforement learning
Demystifying deep reinforement learningDemystifying deep reinforement learning
Demystifying deep reinforement learning재연 윤
 
Head First Reinforcement Learning
Head First Reinforcement LearningHead First Reinforcement Learning
Head First Reinforcement Learningazzeddine chenine
 
An efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningAn efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningPrabhu Kumar
 
Cs221 lecture8-fall11
Cs221 lecture8-fall11Cs221 lecture8-fall11
Cs221 lecture8-fall11darwinrlo
 
anintroductiontoreinforcementlearning-180912151720.pdf
anintroductiontoreinforcementlearning-180912151720.pdfanintroductiontoreinforcementlearning-180912151720.pdf
anintroductiontoreinforcementlearning-180912151720.pdfssuseradaf5f
 
TensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and TricksTensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and TricksBen Ball
 
14_ReinforcementLearning.pptx
14_ReinforcementLearning.pptx14_ReinforcementLearning.pptx
14_ReinforcementLearning.pptxRithikRaj25
 
24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptxManiMaran230751
 
How to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysHow to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysYasutoTamura1
 
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningNAVER Engineering
 
Reinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural NetsReinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural NetsPierre de Lacaze
 
Q-Learning Algorithm: A Concise Introduction [Shakeeb A.]
Q-Learning Algorithm: A Concise Introduction [Shakeeb A.]Q-Learning Algorithm: A Concise Introduction [Shakeeb A.]
Q-Learning Algorithm: A Concise Introduction [Shakeeb A.]Shakeeb Ahmad Mohammad Mukhtar
 
reiniforcement learning.ppt
reiniforcement learning.pptreiniforcement learning.ppt
reiniforcement learning.pptcharusharma165
 

Similar to Reinfrocement Learning (20)

Deep RL.pdf
Deep RL.pdfDeep RL.pdf
Deep RL.pdf
 
Introduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement LearningIntroduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement Learning
 
RL intro
RL introRL intro
RL intro
 
Reinforcement learning, Q-Learning
Reinforcement learning, Q-LearningReinforcement learning, Q-Learning
Reinforcement learning, Q-Learning
 
Demystifying deep reinforement learning
Demystifying deep reinforement learningDemystifying deep reinforement learning
Demystifying deep reinforement learning
 
Head First Reinforcement Learning
Head First Reinforcement LearningHead First Reinforcement Learning
Head First Reinforcement Learning
 
An efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningAn efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game Learning
 
Cs221 lecture8-fall11
Cs221 lecture8-fall11Cs221 lecture8-fall11
Cs221 lecture8-fall11
 
anintroductiontoreinforcementlearning-180912151720.pdf
anintroductiontoreinforcementlearning-180912151720.pdfanintroductiontoreinforcementlearning-180912151720.pdf
anintroductiontoreinforcementlearning-180912151720.pdf
 
Making Complex Decisions(Artificial Intelligence)
Making Complex Decisions(Artificial Intelligence)Making Complex Decisions(Artificial Intelligence)
Making Complex Decisions(Artificial Intelligence)
 
TensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and TricksTensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and Tricks
 
RL.ppt
RL.pptRL.ppt
RL.ppt
 
14_ReinforcementLearning.pptx
14_ReinforcementLearning.pptx14_ReinforcementLearning.pptx
14_ReinforcementLearning.pptx
 
24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx
 
How to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysHow to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative ways
 
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement Learning
 
Reinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural NetsReinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural Nets
 
Q-Learning Algorithm: A Concise Introduction [Shakeeb A.]
Q-Learning Algorithm: A Concise Introduction [Shakeeb A.]Q-Learning Algorithm: A Concise Introduction [Shakeeb A.]
Q-Learning Algorithm: A Concise Introduction [Shakeeb A.]
 
reiniforcement learning.ppt
reiniforcement learning.pptreiniforcement learning.ppt
reiniforcement learning.ppt
 
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
 

More from Natan Katz

AI for PM.pptx
AI for PM.pptxAI for PM.pptx
AI for PM.pptxNatan Katz
 
SGLD Berlin ML GROUP
SGLD Berlin ML GROUPSGLD Berlin ML GROUP
SGLD Berlin ML GROUPNatan Katz
 
Ancestry, Anecdotes & Avanan -DL for Amateurs
Ancestry, Anecdotes & Avanan -DL for Amateurs Ancestry, Anecdotes & Avanan -DL for Amateurs
Ancestry, Anecdotes & Avanan -DL for Amateurs Natan Katz
 
Foundation of KL Divergence
Foundation of KL DivergenceFoundation of KL Divergence
Foundation of KL DivergenceNatan Katz
 
Bayesian Neural Networks
Bayesian Neural NetworksBayesian Neural Networks
Bayesian Neural NetworksNatan Katz
 
Deep VI with_beta_likelihood
Deep VI with_beta_likelihoodDeep VI with_beta_likelihood
Deep VI with_beta_likelihoodNatan Katz
 
NICE Research -Variational inference project
NICE Research -Variational inference projectNICE Research -Variational inference project
NICE Research -Variational inference projectNatan Katz
 
NICE Implementations of Variational Inference
NICE Implementations of Variational Inference NICE Implementations of Variational Inference
NICE Implementations of Variational Inference Natan Katz
 
Variational inference
Variational inference  Variational inference
Variational inference Natan Katz
 
GAN for Bayesian Inference objectives
GAN for Bayesian Inference objectivesGAN for Bayesian Inference objectives
GAN for Bayesian Inference objectivesNatan Katz
 

More from Natan Katz (16)

final_v.pptx
final_v.pptxfinal_v.pptx
final_v.pptx
 
AI for PM.pptx
AI for PM.pptxAI for PM.pptx
AI for PM.pptx
 
SGLD Berlin ML GROUP
SGLD Berlin ML GROUPSGLD Berlin ML GROUP
SGLD Berlin ML GROUP
 
Ancestry, Anecdotes & Avanan -DL for Amateurs
Ancestry, Anecdotes & Avanan -DL for Amateurs Ancestry, Anecdotes & Avanan -DL for Amateurs
Ancestry, Anecdotes & Avanan -DL for Amateurs
 
Cyn meetup
Cyn meetupCyn meetup
Cyn meetup
 
Foundation of KL Divergence
Foundation of KL DivergenceFoundation of KL Divergence
Foundation of KL Divergence
 
Quant2a
Quant2aQuant2a
Quant2a
 
Bismark
BismarkBismark
Bismark
 
Bayesian Neural Networks
Bayesian Neural NetworksBayesian Neural Networks
Bayesian Neural Networks
 
Deep VI with_beta_likelihood
Deep VI with_beta_likelihoodDeep VI with_beta_likelihood
Deep VI with_beta_likelihood
 
NICE Research -Variational inference project
NICE Research -Variational inference projectNICE Research -Variational inference project
NICE Research -Variational inference project
 
NICE Implementations of Variational Inference
NICE Implementations of Variational Inference NICE Implementations of Variational Inference
NICE Implementations of Variational Inference
 
Ucb
UcbUcb
Ucb
 
Neural ODE
Neural ODENeural ODE
Neural ODE
 
Variational inference
Variational inference  Variational inference
Variational inference
 
GAN for Bayesian Inference objectives
GAN for Bayesian Inference objectivesGAN for Bayesian Inference objectives
GAN for Bayesian Inference objectives
 

Recently uploaded

Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...GQ Research
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 

Recently uploaded (20)

Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 

Reinfrocement Learning

  • 2. Learning Through Interaction Sutton: “When an infant plays, waves its arms, or looks about, it has no explicit teacher, but it does have a direct sensorimotor connection to its environment. Exercising this connection produces a wealth of information about cause and effect, about the consequences of actions, and about what to do in order to achieve goals” • Reinforcement learning is a computational approach for this type of learning. It adopts AI perspective to model learning through interaction.
  • 3. • As a single (agent) approaches the system, it takes an action. Upon this action he gets a reward and jumps to the next state. Online learning becomes plausible 3 Reinforcement Learning
  • 4. Reinforcement Objective • Learning the relation between the current situation (state) and the action to be taken in order to optimize a “payment” Predicting the expected future reward given the current state (s) : 1. Which actions should we take in order to maximize our gain 2. Which actions should we take in order to maximize the click rate • The action that is taken influences on the next step “closed loop” • The learner has to discover which action to take (in ML terminology we can write the feature vector as some features are function of others)
  • 5. RL- Elements • State (s) - The place where the user agent is right now. Examples: 1. A position on a chess board 2. A potential customer in a sales web • Action (a)- An action that a user can take while he is in a state. Examples: 1. Knight pawn captures bishop 2. The user buys a ticket • Reward (r) - The reward that is obtained due to the action Examples: 1. A better worse position 2. More money or more clicks
  • 6. Basic Elements (Cont) • Policy (π)- The “strategy” in which the agent decides which action to take. Abstractly speaking the policy is simply a probability function that is defined for each state • Episode – A sequence of states and their actions • 𝑉π (𝑠) - The value function of a state 𝑠 when using policy π. Mostly it is the expected reward (e.g. in a chess the expected final outcome of the game if we follow a given strategy) • V(s) - Similar to 𝑉π (𝑠) without a fixed policy (The expected reward over all possible trajectories starting from 𝑠 ) • Q(s,a) - The analog for V(s) : the planar value function for state s and action a
  • 9. • We wish to find the best slot machine (best = max reward). Strategy Play ! .. and find the machine with the biggest reward (on average) • At the beginning we pick each slot randomly • After several steps we gain some knowledge How do we choose which machine to play? 1. Should we always use the best machine ? 2. Should we pick it randomly? 3. Any other mechanism? 9 Slot Machines n-armed bandit
  • 10. • The common trade-off 1. Play always with best machine -Exploitation We may miss better machines due to statistical “noise” 2. Choose machine randomly - Exploration We don’t take the optimal machine, Epsilon Greedy We exploit in probability (1- ε) and explore with probability ε Typically ε=0.1 10 Exploration ,Exploitation & Epsilon Greedy
  • 11. • Some problems (like n-bandit) are -Next Best Action. 1. A single given state 2. A set of options that are associated with this state 3. A reward for each action • Sometimes we wish to learn journeys Examples: 1. Teach a robot to go from point A to point B 2. Find the fastest way to drive home 11 Episodes
  • 12. • Episode 1. A “time series” of states {S1, S2, S3.. SK} 2. For all state Si There are set of options {O1, O2,..Oki } 3. Learning formula (the “gradient”) depends not only on the immediate rewards but on the next state as well 12 Episode (Cont.)
  • 13. • The observed sequence: st ,at , Rt+1, st+1 ,at+1 , Rt+2 ,………….., sT ,aT , RT+1 , s-state , a-action, r-reward • We optimize our goal function (commonly maximizing the average): Gt = Rt+1 +γRt+2 +γ2 Rt+3 + …… + γ𝑙Rt+l+1 0< γ ≤ 1 –aging factor Classical Example The Pole Balancing Find the exact force to implement in order to keep the pole up The reward is 1 for every time step that The pole didn’t fall Reinforcement Learning – Foundation
  • 14. Markov Property Pr{ St+1 = s’, Rt+1 = r | S0, A0, R1, . . . , St-1, At-1, Rt , St , At }= Pr{ St+1 = s’, Rt+1 = r | St , At } i.e. : The current state captures the entire history • Markov processes are fully determined by the transition matrix P Markov Process (or Markov Chain) A tuple <S,P> where S - set of states (mostly finite), P a state transition probability matrix. Namely: Pss’= P [St+1 = s’ | St = s] Markov Decision Process -MDP
  • 15. A Markov Reward Process -MRP (Markov Chain with Values) A tuple < S,P, R, γ> S ,P as in Markov process, R a reward function Rs = E [Rt+1 | St = s] γ is a discount factor, γ ∈ [0, 1] (as in Gt ) State Value Function for MRP: v(s) = E [Gt | St = s] MDP-cont
  • 16. Bellman Eq. • v(s) = E [Gt | St = s] = E [Rt+1 + γRt+2 +γ2 Rt+3 +... | St = s]= E [Rt+1 + γ (Rt+2 + γRt+3+ ...) | St = s] = E [R t+1 + γG t+1 | S t = s ] We get a recursion rule: v(s) = E[Rt+1 + γ v(s t+1) | St = s] Similalry we can define on a value on state-action space: Q(s,a)= E [Gt | St = s, At =a] MDP - MRP with a finite set of actions A MDP-cont
  • 17. • Recall - policy π is the strategy – it maps between states and actions. π(a|s) = P [At = a | St = s] We assume that for each time t ,and state S π( | St) is fixed (π is stationary ) Clearly for a MDP, a given policy π modifies the MDP: R -> Rπ P->Pπ We modify V & Q Vπ(s) = Eπ [G t | S t = s] Qπ(s,a) = Eπ [G t | S t = s, At =a] Policy
  • 18. • For V (or Q) the optimal value function v* ,for each state s : v*(s) = max π vπ(s) π -policy Solving MDP ≡ Finding the optimal value function!!!! Optimal Policy π ≥ π’ if vπ(s) ≥ v π’(s) ∀s Theorem For every MDP there exists optimal policy Optimal Value Function
  • 19. • If we know 𝑞∗ (s,a) we can find the optimal policy: Optimal Value (Cont)
  • 20. • Dynamic programming • Monte Carlo • TD methods Objectives Prediction - Find the optimal function Control – Find the optimal policy Solution Methods
  • 21. • A class of algorithms used in many applications such as graph theory (shortest path) and bio informatics. It has two essential properties: 1. Can be decomposed to sub solutions 2. Solutions can be cashed and reused RL-MDP satisfies these both • We assume a full knowledge of MDP !!! Prediction Input: MDP and policy Output: Optimal Value function vπ Control Input: MDP Output: Optimal Value function v* Optimal policy π * Dynamic Programming
  • 22. • Assume policy π and MDP we wish to find the optima V π(s) V π(s) = Eπ [Rt+1 + γvπ(St+1) | St =s] • Since policy and MDP are known it is a linear eq. in vi but…. Extremely tedious !!!! Let’s do something iterative (Sutton &Barto) Prediction – Optimal Value Function
  • 23. • Following the previous algorithm one can use an algorithm (often a greedy algorithm) to improve the policy which will lead to an optimal function Policy Improvement (policy iteration)
  • 24. • Policy iteration requires policy updating which can be heavy. • We can study 𝑉∗ and obtain the policy through • The idea is that • Hence we can find 𝑉∗ iteratively (and derive the optimal policy) Value Iteration
  • 25.
  • 26. • The formula supports online update • Bootstrapping • Mostly we don’t have MDP DP -Remarks
  • 27. • A model free (we don’t need MDP) 1. It learns from generating episodes. 2. It must complete an episode for having the required average. 3. It is unbiased • For a policy π S0,A0,R1….. St ~ π We use empirical mean return rather expected return. V(St) =V(St) + 1 𝑁(𝑡) [ Gt –V(St ) ] N(t) – Amount of visits at time t For non-stationary cases we update differently: V(St) =V(St) +α [ Gt –V(St ) ] In MC one must terminate the episode to get the value (we calculate the mean explicitly ) Hence in grid problems it may work bad Monte Carlo Methods
  • 28. • Learn the optimal policy (using Q function): Monte Carlo Control
  • 29. Temporal Difference –TD • Motivation –Combining DP & MC As MC -Learning from experience , no explicit MDP As DP- Bootstrapping, no need to complete the episodes Prediction Recall that for MC we have Where Gt is known only at the end of the episode.
  • 30. TD –Methods (Cont.) • TD method needs to wait until the next step (TD(0)) We can see that it leads to different targets: MC- Gt TD - Rt+1 + γ V(S t+1) • Hence it is a Bootstrapping method The estimaion of V given a policy is straightforwad since the policy chooses S t+1.
  • 32. TD Vs. MC -Summary MC • High variance unbiased • Good convergence • Easy to understand • Low sensitivity for i.c TD • Efficiency • Convergence to V π • More sensitive to i.c.
  • 33. SARSA • On Policy method for Qlearning (update after every step): The next step is using SARSA to develop also a control algorithm, we learn on policy the Q function and update the policy toward greedyness
  • 34. On Policy Control Algorithm
  • 36. Qlearning –Off Policy • Rather learning from an action that has been offered we simply take the best action for the state The control algorithm is straightforward
  • 37. Value Function Approx. • Sometimes we have a large scale RL 1. TD backgammon (Appendix) 2. GO – (Deep Mind) 3. Helicopter (continuous) • Our objectives are still :control & predictions but we have huge amount of states. • The tabular solutions that we presented are not scalable. • Value Function approx. will allow us to use models!!!
  • 38. Value Function (Cont) • Consider a large (continuous ) MDP Vπ (s)= 𝑉′ π (s,w) Qπ (s,a) =𝑄′ π (s,a,w) w –set of function parameters • We can train them by both TD & MC . • We can expand values to unseen states
  • 39. Type of Approximations 1. Linear Combinations 2. Neural networks (lead to DQN) 3. Wavelet solutions
  • 40. Function Approximation on the technics • Define features vectors (X(S)) for the state S. e.g. Distance from target Trend in stock Chess board configuration • Training methods for W • SGD Linear Function get the form: 𝑉′ π =<X(S),W>
  • 41. RL -Based problems • No supervisor, only rewards solutions become:
  • 42. Deep -RL Why using Deep RL? • It allows us to find an optimal model (value/policy) • It allows us to optimize a model • Commonly we will use SGD Examples • Automatic cars • Atari • Deep Mind • TD- Gammon
  • 43. Q – network • We follow the value function approx. approach Q(s,a,w)≈𝑄∗(s,a)
  • 44. Q-Learning • We will simply follow TD target function with supervised manners: Target r+ γmax 𝑎′ 𝑄(𝑠′, 𝑎′, 𝑤) Loss -MSE (r+ γ max 𝑎′ 𝑄(𝑠′, 𝑎′, 𝑤) −Q(s,a,w) )2 • We solve it with SGD
  • 45. Q Network –Stability Issues Divergences • Correlation between successive samples ,non-iid • Policy is not necessarily stationary (influences on Q value) • Scale of rewards and Q value is unknown
  • 46. Deep –Q network Experience Replay Replay the data from the past with the current W It allows to remove correlation in data: • Pick at upon a greedy algorithm • Store in memory the tuple(st, at, rt+1, st+1 ) - Replay • Now calculate the MSE
  • 48. DQN (Cont) Fixed Target Q- Network In order to handle oscillations We calculate targets with respect to old parameters 𝑤− r+ γ max 𝑎′ 𝑄(𝑠′, 𝑎′, 𝑤− ) The loss becomes (r+ γ max 𝑎′ 𝑄(𝑠′ , 𝑎′ , 𝑤− ) −Q(s,a,w) )2 𝑤− <- w
  • 49. DQN –Summary Many further methods: • RewardValue • Double DQN • Parallel Updates Requires another lecture
  • 50. Gradient Policy • We have discussed: 1. Function approximations 2. Algorithms in which policy is learned through the value functions We can parametrize policy using parameters θ : πθ (s, a) =P[a| s, θ] Remark: we focus on model free!!
  • 51.
  • 52. Policy Based Good & Bad Good Better in High dimensions Convergence faster Bad Less efficient for high variance Local minima Example: Rock-Paper-Scissors
  • 53. How to optimize a policy? • We assume it is differentiable and calculate the log-likelihood • We assume further Gibbs distribution i.e. policy exponent in value function πθ (s, a) α 𝑒−θΦ(𝑠,𝑎) Deriving by θ implies: We can also use Gaussian policy
  • 54. Optimize policy (Cont.) Actor-Critic Critic – Update the action-state function by w Actor –Update the policy θ upon the critic suggestion
  • 55.
  • 56. • Rather Learning value functions we learn probabilities. Let At the action that is taken at time t Pr(At =a) = πt (a) = 𝑒Ht (a) 𝑏=1 𝑘 𝑒Ht (b) H – Numerical Preference We assume Gibbs Boltzmann Distribution R¯t - The average until time t Rt - The reward at time t Ht+1(At) = Ht(At) + α (Rt − R¯t )(1 − πt(At) ) Ht+1(a) = Ht(a) − α (Rt − R¯t ) πt(a) ∀a ≠ At Gradient Bandit algorithm
  • 57. Further Reading • Sutton & Barto http://ufal.mff.cuni.cz/~straka/courses/npfl114/2016/sutton- bookdraft2016sep.pdf Pole balancing - https://www.youtube.com/watch?v=Lt-KLtkDlh8 • DeepMind papers • David Silver –Youtube and ucl.ac.uk • TD-Backgammon