This document summarizes the policy gradient reinforcement learning algorithm. It begins by introducing the objective of directly maximizing expected reward over a policy. It then derives the policy gradient theorem, which allows calculating the analytical gradient of the expected reward with respect to the policy parameters. This is used to develop the REINFORCE algorithm, which approximates the policy gradient using sampled episodes. REINFORCE estimates state-action values to compute the policy gradient and updates the policy in the direction of increasing expected reward. Baseline functions can be subtracted from the state-action values to reduce variance in the policy gradient estimate.
https://telecombcn-dl.github.io/2018-dlai/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.
A review of the basic ideas and concepts in reinforcement learning, including discussion of Q-Learning and Sarsa methods. Includes a survey of modern RL methods, including Dyna-Q, DQN, REINFORCE, and AC2, and how they relate.
https://telecombcn-dl.github.io/2018-dlai/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.
A review of the basic ideas and concepts in reinforcement learning, including discussion of Q-Learning and Sarsa methods. Includes a survey of modern RL methods, including Dyna-Q, DQN, REINFORCE, and AC2, and how they relate.
Reinforcement learning: hidden theory, and new super-fast algorithms
Lecture presented at the Center for Systems and Control (CSC@USC) and Ming Hsieh Institute for Electrical Engineering,
February 21, 2018
Stochastic Approximation algorithms are used to approximate solutions to fixed point equations that involve expectations of functions with respect to possibly unknown distributions. The most famous examples today are TD- and Q-learning algorithms. The first half of this lecture will provide an overview of stochastic approximation, with a focus on optimizing the rate of convergence. A new approach to optimize the rate of convergence leads to the new Zap Q-learning algorithm. Analysis suggests that its transient behavior is a close match to a deterministic Newton-Raphson implementation, and numerical experiments confirm super fast convergence.
Based on
@article{devmey17a,
Title = {Fastest Convergence for {Q-learning}},
Author = {Devraj, Adithya M. and Meyn, Sean P.},
Journal = {NIPS 2017 and ArXiv e-prints},
Year = 2017}
Research internship on optimal stochastic theory with financial application u...Asma Ben Slimene
This is a presntation of my second year intership on optimal stochastic theory and how we can apply it on some financial application then how we can solve such problems using finite differences methods!
Enjoy it !
Presentation on stochastic control problem with financial applications (Merto...Asma Ben Slimene
This is an introductory to optimal stochastic control theory with two applications in finance: Merton portfolio problem and Investement/consumption problem with numerical results using finite differences approach
Counterfactual Learning for RecommendationOlivier Jeunen
Slides for our presentation at the REVEAL workshop for RecSys '19 in Copenhagen and a Data Science Leuven Meetup, titled "Counterfactual Learning for Recommendation".
Mean Absolute Percentage Error for regression models, presentation of the paper published in Neurocomputing, 2016.
http://www.sciencedirect.com/science/article/pii/S0925231216003325
We consider the problem of model estimation in episodic Block MDPs. In these MDPs, the decision maker has access to rich observations or contexts generated from a small number of latent states. We are interested in estimating the latent state decoding function (the mapping from the observations to latent states) based on data generated under a fixed behavior policy. We derive an information-theoretical lower bound on the error rate for estimating this function and present an algorithm approaching this fundamental limit. In turn, our algorithm also provides estimates of all the components of the MDP.
We apply our results to the problem of learning near-optimal policies in the reward-free setting. Based on our efficient model estimation algorithm, we show that we can infer a policy converging (as the number of collected samples grows large) to the optimal policy at the best possible asymptotic rate. Our analysis provides necessary and sufficient conditions under which exploiting the block structure yields improvements in the sample complexity for identifying near-optimal policies. When these conditions are met, the sample complexity in the minimax reward-free setting is improved by a multiplicative factor $n$, where $n$ is the number of contexts.
We study an elliptic eigenvalue problem, with a random coefficient that can be parametrised by infinitely-many stochastic parameters. The physical motivation is the criticality problem for a nuclear reactor: in steady state the fission reaction can be modeled by an elliptic eigenvalue
problem, and the smallest eigenvalue provides a measure of how close the reaction is to equilibrium -- in terms of production/absorption of neutrons. The coefficients are allowed to be random to model the uncertainty of the composition of materials inside the reactor, e.g., the
control rods, reactor structure, fuel rods etc.
The randomness in the coefficient also results in randomness in the eigenvalues and corresponding eigenfunctions. As such, our quantity of interest is the expected value, with
respect to the stochastic parameters, of the smallest eigenvalue, which we formulate as an integral over the infinite-dimensional parameter domain. Our approximation involves three steps: truncating the stochastic dimension, discretizing the spatial domain using finite elements and approximating the now finite but still high-dimensional integral.
To approximate the high-dimensional integral we use quasi-Monte Carlo (QMC) methods. These are deterministic or quasi-random quadrature rules that can be proven to be very efficient for the numerical integration of certain classes of high-dimensional functions. QMC methods have previously been applied to linear functionals of the solution of a similar elliptic source problem; however, because of the nonlinearity of eigenvalues the existing analysis of the integration error
does not hold in our case.
We show that the minimal eigenvalue belongs to the spaces required for QMC theory, outline the approximation algorithm and provide numerical results.
Reinforcement Learning is a growing subset of Machine Learning and one of the most important frontiers of Artificial Intelligence. Its goal is to capture higher logic and use more adaptable algorithms than classical Machine Learning.
Formally it denotes a set of algorithms that deal with sequential decision-making and have the potential capability to make highly intelligent decisions depending on their local environment.
Reinforcement Learning problems can be described as an agent that has to make decisions in its environment in order to optimize a cumulative reward, and it is clear that this formalization applies to a great variety of tasks in many different fields.
In this talk, the main features of the most important Reinforcement Learning algorithms will be illustrated and deepened, with some concrete and explanatory examples.
Bio:
Marco Del Pra
Marco was born in Venice 41 years ago, has two master's degrees (Computer Science and Mathematics), and has two important publications in applied mathematics.
He has been working in Artificial Intelligence for 10 years, mainly as a freelancer. Among others, he worked for the European Commission's Joint Research Center, for Cuebiq, and as Data Science Lead for Microsoft's Artificial Intelligence projects in Italy.
Non-linear optimization applications in finance including volatility estimation with ARCH and GARCH models, line search methods, Newton's method, steepest descent method, golden section search method, and conjugate gradient method.
Reinforcement learning: hidden theory, and new super-fast algorithms
Lecture presented at the Center for Systems and Control (CSC@USC) and Ming Hsieh Institute for Electrical Engineering,
February 21, 2018
Stochastic Approximation algorithms are used to approximate solutions to fixed point equations that involve expectations of functions with respect to possibly unknown distributions. The most famous examples today are TD- and Q-learning algorithms. The first half of this lecture will provide an overview of stochastic approximation, with a focus on optimizing the rate of convergence. A new approach to optimize the rate of convergence leads to the new Zap Q-learning algorithm. Analysis suggests that its transient behavior is a close match to a deterministic Newton-Raphson implementation, and numerical experiments confirm super fast convergence.
Based on
@article{devmey17a,
Title = {Fastest Convergence for {Q-learning}},
Author = {Devraj, Adithya M. and Meyn, Sean P.},
Journal = {NIPS 2017 and ArXiv e-prints},
Year = 2017}
Research internship on optimal stochastic theory with financial application u...Asma Ben Slimene
This is a presntation of my second year intership on optimal stochastic theory and how we can apply it on some financial application then how we can solve such problems using finite differences methods!
Enjoy it !
Presentation on stochastic control problem with financial applications (Merto...Asma Ben Slimene
This is an introductory to optimal stochastic control theory with two applications in finance: Merton portfolio problem and Investement/consumption problem with numerical results using finite differences approach
Counterfactual Learning for RecommendationOlivier Jeunen
Slides for our presentation at the REVEAL workshop for RecSys '19 in Copenhagen and a Data Science Leuven Meetup, titled "Counterfactual Learning for Recommendation".
Mean Absolute Percentage Error for regression models, presentation of the paper published in Neurocomputing, 2016.
http://www.sciencedirect.com/science/article/pii/S0925231216003325
We consider the problem of model estimation in episodic Block MDPs. In these MDPs, the decision maker has access to rich observations or contexts generated from a small number of latent states. We are interested in estimating the latent state decoding function (the mapping from the observations to latent states) based on data generated under a fixed behavior policy. We derive an information-theoretical lower bound on the error rate for estimating this function and present an algorithm approaching this fundamental limit. In turn, our algorithm also provides estimates of all the components of the MDP.
We apply our results to the problem of learning near-optimal policies in the reward-free setting. Based on our efficient model estimation algorithm, we show that we can infer a policy converging (as the number of collected samples grows large) to the optimal policy at the best possible asymptotic rate. Our analysis provides necessary and sufficient conditions under which exploiting the block structure yields improvements in the sample complexity for identifying near-optimal policies. When these conditions are met, the sample complexity in the minimax reward-free setting is improved by a multiplicative factor $n$, where $n$ is the number of contexts.
We study an elliptic eigenvalue problem, with a random coefficient that can be parametrised by infinitely-many stochastic parameters. The physical motivation is the criticality problem for a nuclear reactor: in steady state the fission reaction can be modeled by an elliptic eigenvalue
problem, and the smallest eigenvalue provides a measure of how close the reaction is to equilibrium -- in terms of production/absorption of neutrons. The coefficients are allowed to be random to model the uncertainty of the composition of materials inside the reactor, e.g., the
control rods, reactor structure, fuel rods etc.
The randomness in the coefficient also results in randomness in the eigenvalues and corresponding eigenfunctions. As such, our quantity of interest is the expected value, with
respect to the stochastic parameters, of the smallest eigenvalue, which we formulate as an integral over the infinite-dimensional parameter domain. Our approximation involves three steps: truncating the stochastic dimension, discretizing the spatial domain using finite elements and approximating the now finite but still high-dimensional integral.
To approximate the high-dimensional integral we use quasi-Monte Carlo (QMC) methods. These are deterministic or quasi-random quadrature rules that can be proven to be very efficient for the numerical integration of certain classes of high-dimensional functions. QMC methods have previously been applied to linear functionals of the solution of a similar elliptic source problem; however, because of the nonlinearity of eigenvalues the existing analysis of the integration error
does not hold in our case.
We show that the minimal eigenvalue belongs to the spaces required for QMC theory, outline the approximation algorithm and provide numerical results.
Reinforcement Learning is a growing subset of Machine Learning and one of the most important frontiers of Artificial Intelligence. Its goal is to capture higher logic and use more adaptable algorithms than classical Machine Learning.
Formally it denotes a set of algorithms that deal with sequential decision-making and have the potential capability to make highly intelligent decisions depending on their local environment.
Reinforcement Learning problems can be described as an agent that has to make decisions in its environment in order to optimize a cumulative reward, and it is clear that this formalization applies to a great variety of tasks in many different fields.
In this talk, the main features of the most important Reinforcement Learning algorithms will be illustrated and deepened, with some concrete and explanatory examples.
Bio:
Marco Del Pra
Marco was born in Venice 41 years ago, has two master's degrees (Computer Science and Mathematics), and has two important publications in applied mathematics.
He has been working in Artificial Intelligence for 10 years, mainly as a freelancer. Among others, he worked for the European Commission's Joint Research Center, for Cuebiq, and as Data Science Lead for Microsoft's Artificial Intelligence projects in Italy.
Non-linear optimization applications in finance including volatility estimation with ARCH and GARCH models, line search methods, Newton's method, steepest descent method, golden section search method, and conjugate gradient method.
Explore the innovative world of trenchless pipe repair with our comprehensive guide, "The Benefits and Techniques of Trenchless Pipe Repair." This document delves into the modern methods of repairing underground pipes without the need for extensive excavation, highlighting the numerous advantages and the latest techniques used in the industry.
Learn about the cost savings, reduced environmental impact, and minimal disruption associated with trenchless technology. Discover detailed explanations of popular techniques such as pipe bursting, cured-in-place pipe (CIPP) lining, and directional drilling. Understand how these methods can be applied to various types of infrastructure, from residential plumbing to large-scale municipal systems.
Ideal for homeowners, contractors, engineers, and anyone interested in modern plumbing solutions, this guide provides valuable insights into why trenchless pipe repair is becoming the preferred choice for pipe rehabilitation. Stay informed about the latest advancements and best practices in the field.
Student information management system project report ii.pdfKamal Acharya
Our project explains about the student management. This project mainly explains the various actions related to student details. This project shows some ease in adding, editing and deleting the student details. It also provides a less time consuming process for viewing, adding, editing and deleting the marks of the students.
Event Management System Vb Net Project Report.pdfKamal Acharya
In present era, the scopes of information technology growing with a very fast .We do not see any are untouched from this industry. The scope of information technology has become wider includes: Business and industry. Household Business, Communication, Education, Entertainment, Science, Medicine, Engineering, Distance Learning, Weather Forecasting. Carrier Searching and so on.
My project named “Event Management System” is software that store and maintained all events coordinated in college. It also helpful to print related reports. My project will help to record the events coordinated by faculties with their Name, Event subject, date & details in an efficient & effective ways.
In my system we have to make a system by which a user can record all events coordinated by a particular faculty. In our proposed system some more featured are added which differs it from the existing system such as security.
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)MdTanvirMahtab2
This presentation is about the working procedure of Shahjalal Fertilizer Company Limited (SFCL). A Govt. owned Company of Bangladesh Chemical Industries Corporation under Ministry of Industries.
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxR&R Consult
CFD analysis is incredibly effective at solving mysteries and improving the performance of complex systems!
Here's a great example: At a large natural gas-fired power plant, where they use waste heat to generate steam and energy, they were puzzled that their boiler wasn't producing as much steam as expected.
R&R and Tetra Engineering Group Inc. were asked to solve the issue with reduced steam production.
An inspection had shown that a significant amount of hot flue gas was bypassing the boiler tubes, where the heat was supposed to be transferred.
R&R Consult conducted a CFD analysis, which revealed that 6.3% of the flue gas was bypassing the boiler tubes without transferring heat. The analysis also showed that the flue gas was instead being directed along the sides of the boiler and between the modules that were supposed to capture the heat. This was the cause of the reduced performance.
Based on our results, Tetra Engineering installed covering plates to reduce the bypass flow. This improved the boiler's performance and increased electricity production.
It is always satisfying when we can help solve complex challenges like this. Do your systems also need a check-up or optimization? Give us a call!
Work done in cooperation with James Malloy and David Moelling from Tetra Engineering.
More examples of our work https://www.r-r-consult.dk/en/cases-en/
Forklift Classes Overview by Intella PartsIntella Parts
Discover the different forklift classes and their specific applications. Learn how to choose the right forklift for your needs to ensure safety, efficiency, and compliance in your operations.
For more technical information, visit our website https://intellaparts.com
Democratizing Fuzzing at Scale by Abhishek Aryaabh.arya
Presented at NUS: Fuzzing and Software Security Summer School 2024
This keynote talks about the democratization of fuzzing at scale, highlighting the collaboration between open source communities, academia, and industry to advance the field of fuzzing. It delves into the history of fuzzing, the development of scalable fuzzing platforms, and the empowerment of community-driven research. The talk will further discuss recent advancements leveraging AI/ML and offer insights into the future evolution of the fuzzing landscape.
Welcome to WIPAC Monthly the magazine brought to you by the LinkedIn Group Water Industry Process Automation & Control.
In this month's edition, along with this month's industry news to celebrate the 13 years since the group was created we have articles including
A case study of the used of Advanced Process Control at the Wastewater Treatment works at Lleida in Spain
A look back on an article on smart wastewater networks in order to see how the industry has measured up in the interim around the adoption of Digital Transformation in the Water Industry.
2. These slides are almost the exact copy of Practical RL course week 6 slides.
Special thanks to YSDA team for making them publicly available.
Original slides link: week06_policy_based
References
10. 9
Approximation error
DQN is trained to minimize
Simple 2-state world
L≈E[Q(st ,at)−(rt +γ⋅maxa' Q(st+1, a'))]
2
True (A) (B)
Q(s0,a0) 1 1 2
Q(s0,a1) 2 2 1
Q(s1,a0) 3 3 3
Q(s1,a1) 100 50 100
better
policy
less
MSE
Q-learning will prefer worse policy (B)!
11. 10
Conclusion
● Often computing q-values is harder than
picking optimal actions!
● We could avoid learning value functions by
directly learning agent's policy
Q: what algorithm works that way?
(of those we studied)
πθ(a∣s)
12. 11
Conclusion
● Often computing q-values is harder than
picking optimal actions!
● We could avoid learning value functions by
directly learning agent's policy
Q: what algorithm works that way?
πθ(a∣s)
e.g. crossentropy method
13. 12
NOT how humans survived
argmax[
Q(s,pet the tiger)
Q(s,run from tiger)
Q(s,provoke tiger)
Q(s,ignore tiger)
]
15. 14
Policies
In general, two kinds
● Deterministic policy
● Stochastic policy
a=πθ (s)
a∼πθ (a∣s)
Q: Any case where stochastic is better?
16. 15
Policies
In general, two kinds
● Deterministic policy
● Stochastic policy
a=πθ (s)
a∼πθ (a∣s)
Q: Any case where stochastic is better?
e.g. rock-paper
-scissors
17. 16
Policies
In general, two kinds
● Deterministic policy
● Stochastic policy sampling takes care
of exploration
same action each time
Genetic algos (week 0)
Deterministic policy gradient a=πθ (s)
a∼πθ (a∣s)
Crossentropy method
Policy gradient
Q: how to represent policy in continuous action space?
18. 17
Policies
In general, two kinds
● Deterministic policy
● Stochastic policy sampling takes care
of exploration
same action each time
Genetic algos (week 0)
Deterministic policy gradient a=πθ (s)
a∼πθ (a∣s)
Crossentropy method
Policy gradient
categorical, normal, mixture of normal, whatever
19. 18
Two approaches
● Value based:
Learn value function or
Infer policy
● Policy based:
Explicitly learn policy or
Implicitly maximize reward over policy
a=argmax
a
Qθ(s,a)
Qθ(s ,a) Vθ(s)
πθ(a∣s) πθ(s)→a
20. 19
Recap: crossentropy method
● Initialize NN weights
● Loop:
– Sample N sessions
– elite = take M best sessions and concatenate
θ0←random
θi+1=θi+α ∇ ∑
i
log πθi
(ai∣si)⋅
[ si ,ai∈Elite]
21. 20
Recap: crossentropy method
● Initialize NN weights
● Loop:
– Sample N sessions
– elite = take M best sessions and concatenate
TD version: elite (s,a) that have highest G(s,a)
(select elites independently from each state)
θ0←random
θi+1=θi+α ∇ ∑
i
log πθi
(ai∣si)⋅
[ si ,ai∈Elite]
26. 25
Objective
state visitation frequency
(may depend on policy)
Q: how do we compute that?
J = E
s∼p(s)
a∼πθ (s∣a)
R(s, a)=∫
s
p(s)∫
a
πθ(a∣s)R(s,a)da ds
Reward for 1-step
session
Consider an 1-step process for simplicity
27. 26
Objective
J = E
s∼p(s)
a∼πθ (s∣a)
R(s, a)=∫
s
p(s)∫
a
πθ(a∣s)R(s,a)da ds
sample N sessions
under current policy
J≈
1
N
∑
i=0
N
R(s ,a)
28. 27
Objective
Can we optimize policy now?
sample N sessions
J = E
s∼p(s)
a∼πθ (s∣a)
R(s, a)=∫
s
p(s)∫
a
πθ(a∣s)R(s,a)da ds
J≈
1
N
∑
i=0
N
R(s ,a)
29. 28
Objective
parameters “sit” here
We don't know how to compute dJ/dtheta
J = E
s∼p(s)
a∼πθ (s∣a)
R(s, a)=∫
s
p(s)∫
a
πθ(a∣s)R(s,a)da ds
J≈
1
N
∑
i=0
N
∑
s,a∈zi
R(s,a)
30. 29
Optimization
Finite differences
– Change policy a little, evaluate
Stochastic optimization
– Good old crossentropy method
– Maximize probability of “elite” actions
∇ J≈
Jθ+ϵ−Jθ
ϵ
31. 30
Optimization
Finite differences
– Change policy a little, evaluate
Stochastic optimization
– Good old crossentropy method
– Maximize probability of “elite” actions
∇ J≈
Jθ+ϵ−Jθ
ϵ
Q: any problems with those two?
32. 31
Optimization
Finite differences
– Change policy a little, evaluate
Stochastic optimization
– Good old crossentropy method
– Maximize probability of “elite” actions
∇ J≈
Jθ+ϵ−Jθ
ϵ
VERY noizy, especially
if both J are sampled
“quantile convergence”
problems with stochastic
MDPs
39. 38
REINFORCE (bandit)
∇ J≈
1
N
∑
i=0
N
∇ log πθ(a∣s)⋅R(s,a)
● Initialize NN weights
● Loop:
– Sample N sessions z under current
– Evaluate policy gradient
– Ascend
θ0←random
θi+1←θi+α⋅∇ J
40. 39
Discounted reward case
● Replace R with Q :)
π⋅∇ log π(z)=∇ π( z)
that's expectation :)
∇ J=∫
s
p(s)∫
a
∇ πθ (a∣s)Q(s,a)dads
∇ J=∫
s
p(s)∫
a
πθ (a∣s)∇ log πθ(a∣s)Q(s,a)dads
True action value
a.k.a. E[ G(s,a) ]
41. 40
REINFORCE (discounted)
● Policy gradient
● Approximate with sampling
∇ J= E
s∼p(s)
a∼πθ (s∣a)
∇ logπθ(a∣s)⋅Q(s, a)
∇ J≈
1
N
∑
i=0
N
∑
s ,a∈zi
∇ logπθ(a∣s)⋅Q(s,a)
42. 41
REINFORCE algorithm
We can estimate Q using G
Gt=rt+γ⋅rt+1+γ
2
⋅rt+2+...
Qπ (st ,at)=Es' G(st ,at)
prev s
prev a a''
s''
s'
a
a'
r’
s
r'’
r
r'’’
43. 42
We can use this to compue all G’s
in linear time
Recap: discounted rewards
Gt=rt+γ⋅rt+1+γ
2
⋅rt+2+...
rt +γ⋅(rt+1+γ⋅rt+2+...)
rt +γ⋅Gt+1
=
= =
=
44. 43
REINFORCE algorithm
∇ J≈
1
N
∑
i=0
N
∑
s ,a∈zi
∇ logπθ(a∣s)⋅Q(s,a)
● Initialize NN weights
● Loop:
– Sample N sessions z under current
– Evaluate policy gradient
– Ascend
θ0←random
πθ(a∣s)
θi+1←θi+α⋅∇ J
45. 44
REINFORCE algorithm
∇ J≈
1
N
∑
i=0
N
∑
s ,a∈zi
∇ logπθ(a∣s)⋅Q(s,a)
● Initialize NN weights
● Loop:
– Sample N sessions z under current
– Evaluate policy gradient
– Ascend
θ0←random
πθ(a∣s)
θi+1←θi+α⋅∇ J
Q: is it off- or on-policy?
46. 45
REINFORCE algorithm
∇ J≈
1
N
∑
i=0
N
∑
s ,a∈zi
∇ logπθ(a∣s)⋅Q(s,a)
● Initialize NN weights
● Loop:
– Sample N sessions z under current
– Evaluate policy gradient
– Ascend
θ0←random
πθ(a∣s)
θi+1←θi+α⋅∇ J
actions under current policy
= on-policy
47. value-based Vs policy-based
Value-based
● Q-learning, SARSA, MCTS
value-iteration
● Solves harder problem
● Artificial exploration
● Learns from partial experience
(temporal difference)
● Evaluates strategy for free :)
Policy-based
● REINFORCE, CEM
● Solves easier problem
● Innate exploration
● Innate stochasticity
● Support continuous action space
● Learns from full session only?
48. value-based Vs policy-based
Value-based
● Q-learning, SARSA, MCTS
value-iteration
● Solves harder problem
● Artificial exploration
● Learns from partial experience
(temporal difference)
● Evaluates strategy for free :)
Policy-based
● REINFORCE, CEM
● Solves easier problem
● Innate exploration
● Innate stochasticity
● Support continuous action space
● Learns from full session only
We'll learn much more soon!
49. 48
REINFORCE baselines
∇ J≈
1
N
∑
i=0
N
∑
s ,a∈zi
∇ logπθ(a∣s)⋅Q(s,a)
● Initialize NN weights
● Loop:
– Sample N sessions z under current
– Evaluate policy gradient
θ0←random
πθ(a∣s)
What is better for learning:
random action in good state
or
great action in bad state?
50. 49
REINFORCE baselines
We can subtract arbitrary baseline b(s)
∇ J= E
s∼p(s)
a∼πθ (a∣s)
∇ logπθ(a∣s)(Q(s,a)−b(s))=...
...= E
s∼p(s)
a∼πθ (a∣s)
∇ logπθ(a∣s)Q(s ,a)− E
s∼p(s)
a∼πθ (a∣s)
∇ logπθ(a∣s)b(s)=...
51. 50
REINFORCE baselines
We can subtract arbitrary baseline b(s)
∇ J= E
s∼p(s)
a∼πθ (a∣s)
∇ logπθ(a∣s)(Q(s,a)−b(s))=...
...= E
s∼p(s)
a∼πθ (a∣s)
∇ logπθ(a∣s)Q(s ,a)− E
s∼p(s)
a∼πθ (a∣s)
∇ logπθ(a∣s)b(s)=...
Note that b(s) does not depend on a
Q: Can you simplify the second term?
52. 51
REINFORCE baselines
We can subtract arbitrary baseline b(s)
∇ J= E
s∼p(s)
a∼πθ (a∣s)
∇ logπθ(a∣s)(Q(s,a)−b(s))=...
...= E
s∼p(s)
a∼πθ (a∣s)
∇ logπθ(a∣s)Q(s ,a)− E
s∼p(s)
a∼πθ (a∣s)
∇ logπθ(a∣s)b(s)=...
E
s∼p(s)
a∼πθ (a∣s)
∇ log πθ(a∣s)b(s)=b(s)⋅ E
s∼p(s)
a∼πθ (a∣s)
∇ log πθ(a∣s)=0
53. 52
REINFORCE baselines
We can subtract arbitrary baseline b(s)
∇ J= E
s∼p(s)
a∼πθ (a∣s)
∇ logπθ(a∣s)(Q(s,a)−b(s))=...
...= E
s∼p(s)
a∼πθ (a∣s)
∇ logπθ(a∣s)Q(s ,a)− E
s∼p(s)
a∼πθ (a∣s)
∇ logπθ(a∣s)b(s)=...
...= E
s∼p(s)
a∼πθ (a∣s)
∇ logπθ(a∣s)Q(s ,a)
Gradient direction doesn’t change!
54. 53
REINFORCE baselines
● Gradient direction stays the same
● Variance may change
Gradient variance:
as a random variable over (s, a)
∇ J
Var[Q(s,a)−b(s)]
Var[Q(s,a)]−2⋅Cov[Q(s ,a),b(s)]+Var[b(s)]
55. 54
REINFORCE baselines
● Gradient direction stays the same
● Variance may change
Gradient variance:
as a random variable over (s, a)
∇ J
Var[Q(s,a)−b(s)]
If b(s) correlates with Q(s,a), variance decreases
Var[Q(s,a)]−2⋅Cov[Q(s ,a),b(s)]+Var[b(s)]
56. 55
REINFORCE baselines
● Gradient direction stays the same
● Variance may change
Gradient variance:
as a random variable over (s, a)
∇ J
Var[Q(s,a)−b(s)]
Q: can you suggest any such b(s)?
Var[Q(s,a)]−2⋅Cov[Q(s ,a),b(s)]+Var[b(s)]
57. 56
REINFORCE baselines
● Gradient direction stays the same
● Variance may change
Gradient variance:
as a random variable over (s, a)
∇ J
Var[Q(s,a)−b(s)]
Naive baseline: b = moving average Q
over all (s, a), Var[b(s)] = 0, Cov[Q, b] > 0
Var[Q(s,a)]−2⋅Cov[Q(s ,a),b(s)]+Var[b(s)]
60. 59
Advantage actor-critic
Idea: learn both and
Use to learn faster!
Q: how can we estimate A(s,a)
from (s,a,r,s') and V-function?
Vθ(s) πθ(a∣s)
πθ(a∣s) Vθ(s)
61. 60
Advantage actor-critic
Idea: learn both and
Use to learn faster!
Vθ(s) πθ(a∣s)
πθ(a∣s) Vθ(s)
A(s ,a)=Q(s ,a)−V (s)
Q(s,a)=r+γ⋅V (s ')
A(s ,a)=r+ γ⋅V (s')−V (s)
62. 61
Advantage actor-critic
Idea: learn both and
Use to learn faster!
Vθ(s) πθ(a∣s)
πθ(a∣s) Vθ(s)
A(s ,a)=Q(s ,a)−V (s)
Q(s,a)=r+γ⋅V (s ')
A(s ,a)=r+ γ⋅V (s')−V (s)
Also: n-step
version
63. 62
Advantage actor-critic
Idea: learn both and
Use to learn faster!
Vθ(s) πθ(a∣s)
πθ(a∣s) Vθ(s)
A(s ,a)=r+ γ⋅V (s')−V (s)
∇ Jactor≈
1
N
∑
i=0
N
∑
s,a∈zi
∇ logπθ(a∣s)⋅A(s ,a)
consider
const
65. 64
Continuous action spaces
What if there's continuously many actions?
● Robot control: control motor voltage
● Trading: assign money to equity
How does the algorithm change?
66. 65
Continuous action spaces
What if there's continuously many actions?
● Robot control: control motor voltage
● Trading: assign money to equity
How does the algorithm change?
it doesn't :)
Just plug in a different formula for
pi(a|s), e.g. normal distribution
67. 66
Asynchronous advantage actor-critic
● Parallel game sessions
● Async multi-CPU training
● No experience replay
● LSTM policy
● N-step advantage
● No experience replay
Read more: https://arxiv.org/abs/1602.01783
69. 68
Duct tape zone
● V(s) errors less important than in Q-learning
– actor still learns even if critic is random, just slower
● Regularize with entropy
– to prevent premature convergence
● Learn on parallel sessions
– Or super-small experience replay
● Use logsoftmax for numerical stability
71. ● Remember log-derivative trick
● Combining best from both worlds is generally a good idea
● See this paper for the proof of the policy gradient for
discounted rewards
● Time to write some code!
Outro and Q & A
70