The document provides an overview of policy gradient algorithms. It begins by motivating policy gradients as a way to perform policy improvement with gradient ascent instead of argmax when the action space is large or continuous. It then defines key concepts like the policy gradient theorem, which shows the gradient of expected returns with respect to the policy parameters can be estimated by sampling simulation paths. It presents the policy gradient algorithm REINFORCE, which uses Monte Carlo returns to estimate the action-value function and perform policy gradient ascent updates. It also discusses canonical policy function approximations like the softmax policy for finite actions and Gaussian policy for continuous actions.
The document summarizes the policy gradient theorem, which provides a way to perform policy improvement in reinforcement learning using gradient ascent on the expected returns with respect to the policy parameters. It begins by motivating policy gradients as a way to do policy improvement when the action space is large or continuous. It then defines the necessary notation, expected returns objective function, and discounted state visitation measure. The main part of the document proves the policy gradient theorem, which expresses the policy gradient as an expectation over the discounted state visitation measure and action-value function. It notes that in practice the action-value function must be estimated, and proves the compatible function approximation theorem, which ensures the policy gradient is computed correctly when using an estimated action-value
This document discusses model-free continuous control in reinforcement learning. It introduces policy gradient and value gradient methods for learning continuous control policies with neural networks. Policy gradient methods directly optimize the policy parameters with the policy gradient. Value gradient methods optimize the policy based on the gradient of a learned state-action value function. Actor-critic methods combine policy gradients with value functions to reduce variance.
The document provides a summary of Semi-Markov Decision Processes (SMDPs) in 10 points:
1. It describes the basic components of an SMDP including states, actions, rewards, policies, and value functions.
2. It discusses the concepts of optimal policies, average reward models, and discount factors in SMDPs.
3. It introduces the idea of transition times in SMDPs, which allows actions to take varying amounts of time. This makes SMDPs a generalization of Markov Decision Processes.
4. It notes that algorithms for solving SMDPs typically involve estimating the average reward per action to find an optimal policy.
This document presents a majorization-minimization framework for reinforcement learning. It introduces the relative policy performance identity which relates policy space to performance space. The framework majorizes the objective function using a KL divergence term to control changes to the policy. Natural policy gradient is derived from this framework by taking the Taylor expansion. Several algorithms are discussed including trust region policy optimization which uses conjugate gradient to estimate the natural policy gradient.
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIJack Clark
This document discusses deep reinforcement learning through policy optimization. It begins with an introduction to reinforcement learning and how deep neural networks can be used to approximate policies, value functions, and models. It then discusses how deep reinforcement learning can be applied to problems in robotics, business operations, and other machine learning domains. The document reviews how reinforcement learning relates to other machine learning problems like supervised learning and contextual bandits. It provides an overview of policy gradient methods and the cross-entropy method for policy optimization before discussing Markov decision processes, parameterized policies, and specific policy gradient algorithms like the vanilla policy gradient algorithm and trust region policy optimization.
This document summarizes the policy gradient reinforcement learning algorithm. It begins by introducing the objective of directly maximizing expected reward over a policy. It then derives the policy gradient theorem, which allows calculating the analytical gradient of the expected reward with respect to the policy parameters. This is used to develop the REINFORCE algorithm, which approximates the policy gradient using sampled episodes. REINFORCE estimates state-action values to compute the policy gradient and updates the policy in the direction of increasing expected reward. Baseline functions can be subtracted from the state-action values to reduce variance in the policy gradient estimate.
The document summarizes the policy gradient theorem, which provides a way to perform policy improvement in reinforcement learning using gradient ascent on the expected returns with respect to the policy parameters. It begins by motivating policy gradients as a way to do policy improvement when the action space is large or continuous. It then defines the necessary notation, expected returns objective function, and discounted state visitation measure. The main part of the document proves the policy gradient theorem, which expresses the policy gradient as an expectation over the discounted state visitation measure and action-value function. It notes that in practice the action-value function must be estimated, and proves the compatible function approximation theorem, which ensures the policy gradient is computed correctly when using an estimated action-value
This document discusses model-free continuous control in reinforcement learning. It introduces policy gradient and value gradient methods for learning continuous control policies with neural networks. Policy gradient methods directly optimize the policy parameters with the policy gradient. Value gradient methods optimize the policy based on the gradient of a learned state-action value function. Actor-critic methods combine policy gradients with value functions to reduce variance.
The document provides a summary of Semi-Markov Decision Processes (SMDPs) in 10 points:
1. It describes the basic components of an SMDP including states, actions, rewards, policies, and value functions.
2. It discusses the concepts of optimal policies, average reward models, and discount factors in SMDPs.
3. It introduces the idea of transition times in SMDPs, which allows actions to take varying amounts of time. This makes SMDPs a generalization of Markov Decision Processes.
4. It notes that algorithms for solving SMDPs typically involve estimating the average reward per action to find an optimal policy.
This document presents a majorization-minimization framework for reinforcement learning. It introduces the relative policy performance identity which relates policy space to performance space. The framework majorizes the objective function using a KL divergence term to control changes to the policy. Natural policy gradient is derived from this framework by taking the Taylor expansion. Several algorithms are discussed including trust region policy optimization which uses conjugate gradient to estimate the natural policy gradient.
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIJack Clark
This document discusses deep reinforcement learning through policy optimization. It begins with an introduction to reinforcement learning and how deep neural networks can be used to approximate policies, value functions, and models. It then discusses how deep reinforcement learning can be applied to problems in robotics, business operations, and other machine learning domains. The document reviews how reinforcement learning relates to other machine learning problems like supervised learning and contextual bandits. It provides an overview of policy gradient methods and the cross-entropy method for policy optimization before discussing Markov decision processes, parameterized policies, and specific policy gradient algorithms like the vanilla policy gradient algorithm and trust region policy optimization.
This document summarizes the policy gradient reinforcement learning algorithm. It begins by introducing the objective of directly maximizing expected reward over a policy. It then derives the policy gradient theorem, which allows calculating the analytical gradient of the expected reward with respect to the policy parameters. This is used to develop the REINFORCE algorithm, which approximates the policy gradient using sampled episodes. REINFORCE estimates state-action values to compute the policy gradient and updates the policy in the direction of increasing expected reward. Baseline functions can be subtracted from the state-action values to reduce variance in the policy gradient estimate.
Reinforcement learning: hidden theory, and new super-fast algorithms
Lecture presented at the Center for Systems and Control (CSC@USC) and Ming Hsieh Institute for Electrical Engineering,
February 21, 2018
Stochastic Approximation algorithms are used to approximate solutions to fixed point equations that involve expectations of functions with respect to possibly unknown distributions. The most famous examples today are TD- and Q-learning algorithms. The first half of this lecture will provide an overview of stochastic approximation, with a focus on optimizing the rate of convergence. A new approach to optimize the rate of convergence leads to the new Zap Q-learning algorithm. Analysis suggests that its transient behavior is a close match to a deterministic Newton-Raphson implementation, and numerical experiments confirm super fast convergence.
Based on
@article{devmey17a,
Title = {Fastest Convergence for {Q-learning}},
Author = {Devraj, Adithya M. and Meyn, Sean P.},
Journal = {NIPS 2017 and ArXiv e-prints},
Year = 2017}
Stochastic computation graphs provide a framework for automatically deriving unbiased gradient estimators. They generalize backpropagation to deal with random variables by treating the computation graph as a DAG with both deterministic and stochastic nodes. This allows gradients to be computed through expectations, enabling techniques like policy gradients for reinforcement learning and variational inference. The document describes several policy gradient methods that use stochastic computation graphs to compute gradients, including SVG(0), SVG(1), and DDPG. These methods have been successfully applied to robotics tasks and driving.
1. The document discusses natural policy gradients, Trust Region Policy Optimization (TRPO), and Proximal Policy Optimization (PPO) for deep reinforcement learning and control.
2. It explains that policy gradient methods aim to directly optimize the policy parameters but have challenges with step size selection and sample efficiency.
3. Natural policy gradients address this by taking steps in the policy space rather than parameter space, using the Fisher information matrix to define a natural gradient that considers how parameter changes affect the policy distribution. This makes step size selection easier by constraining changes to a threshold KL divergence in policy space.
Adaptive Multistage Sampling Algorithm: The Origins of Monte Carlo Tree SearchAshwin Rao
The document summarizes the Adaptive Multistage Sampling (AMS) algorithm, which is considered the "spiritual origin" of Monte Carlo Tree Search (MCTS). AMS is a generic simulation-based algorithm for solving finite-horizon Markov Decision Processes when the state space is very large. It works by allocating a fixed number of action selections per state over time steps, and using an explore-exploit formula to select actions during simulations. The estimates produced by AMS are asymptotically unbiased and converge to the optimal values.
This document provides an overview of reinforcement learning. It discusses the reinforcement learning framework including actors like agents, environments, states, actions, rewards, and policies. It also summarizes several common reinforcement learning methods including value-based methods, policy-based methods, and model-based methods. Value-based methods estimate value functions using algorithms like Q-learning and deep Q-networks. Policy-based methods directly learn policies using policy gradient algorithms like REINFORCE. Model-based methods learn models of the environment and then plan based on these models.
The document summarizes several advanced policy gradient methods for reinforcement learning, including trust region policy optimization (TRPO), proximal policy optimization (PPO), and using the natural policy gradient with the Kronecker-factored approximation (K-FAC). TRPO frames policy optimization as solving a constrained optimization problem to limit policy updates, while PPO uses a clipped objective function as a pessimistic bound. Both methods improve upon vanilla policy gradients. K-FAC provides an efficient way to approximate the natural policy gradient using the Fisher information matrix. The document reviews the theory and algorithms behind these methods.
Hierarchical Reinforcement Learning with Option-Critic ArchitectureNecip Oguz Serbetci
This document describes hierarchical reinforcement learning using the option-critic architecture. The option-critic architecture allows for online, end-to-end learning of options in continuous state and action spaces by learning the initiation, intra-option policy, and termination policy of options using deep reinforcement learning techniques like policy gradients. The option-critic architecture extends the options framework by allowing options to be represented by neural networks and learned online through actor-critic methods.
Approximate Bayesian Computation with Quasi-LikelihoodsStefano Cabras
This document describes ABC-MCMC algorithms that use quasi-likelihoods as proposals. It introduces quasi-likelihoods as approximations to true likelihoods that can be estimated from pilot runs. The ABCql algorithm uses a quasi-likelihood estimated from a pilot run as the proposal in an ABC-MCMC algorithm. Examples applying ABCql to mixture of normals, coalescent, and gamma models are provided to demonstrate its effectiveness compared to standard ABC-MCMC.
Introduction to reinforcement learning - Phu NguyenTu Le Dinh
The document introduces reinforcement learning and uses the example of creating a bot to play FlappyBird. It discusses what reinforcement learning is, including the agent-environment interaction and rewards. It also covers Markov decision processes, value functions, exploration vs exploitation, and algorithms like Q-learning. It concludes with a demo of a Deep Q-Network agent learning to play FlappyBird and recommends further courses and books on machine learning and reinforcement learning topics.
The document discusses control as inference in Markov decision processes (MDPs) and partially observable MDPs (POMDPs). It introduces optimality variables that represent whether a state-action pair is optimal or not. It formulates the optimal action-value function Q* and optimal value function V* in terms of these optimality variables and the reward and transition distributions. Q* is defined as the log probability of a state-action pair being optimal, and V* is defined as the log probability of a state being optimal. Bellman equations are derived relating Q* and V* to the reward and next state value.
The document discusses algorithm analysis and asymptotic notation. It defines algorithm analysis as comparing algorithms based on running time and other factors as problem size increases. Asymptotic notation such as Big-O, Big-Omega, and Big-Theta are introduced to classify algorithms based on how their running times grow relative to input size. Common time complexities like constant, logarithmic, linear, quadratic, and exponential are also covered. The properties and uses of asymptotic notation for equations and inequalities are explained.
This Presentation describes, in short, Introduction to Time Series and the overall procedure required for Time Series Modelling including general terminologies and algorithms. However the detailed Mathematics is excluded in the slides, this ppt means to give a start to understanding the Time Series Modelling before going into detailed Statistics.
Evolutionary Strategies as an alternative to Reinforcement LearningAshwin Rao
This document introduces Evolutionary Strategies (ES), a black-box optimization technique inspired by natural evolution. ES frames optimization as maximizing the expected value of an objective function over a parameterized probability distribution of candidate solutions. It uses stochastic gradient ascent to update the distribution's parameters based on sampled solutions and their objective values. The document applies ES to solving Markov Decision Processes by sampling candidate policies and estimating the gradient of expected return. While simpler than reinforcement learning, ES can be effective for optimization problems and more robust than advanced policy gradient methods.
This document provides an overview of reinforcement learning. It discusses how reinforcement learning differs from supervised and unsupervised learning by focusing on choosing actions to maximize long-term reward. It also formalizes reinforcement learning problems using Markov decision processes and describes algorithms like value iteration and policy iteration that can be used to find optimal policies.
The document discusses statistical estimation methods that use joint regularization. Specifically, it discusses using thresholding rules and nonconvex penalties within an additive robust framework for statistical regression. Key points:
- Thresholding rules Θ can induce nonconvex penalty functions P and allow reformulating regression as a proximity operator problem.
- An additive robust framework combines thresholding Θ with a ψ function to perform M-estimation, as long as Θ + ψ = identity.
- Generalized group sparsity pursuit extends this to multiple nonconvex penalties and response variables. An algorithm is developed using linearization and scaled thresholding rules.
- Challenges include analyzing convergence of nonconvex algorithms, understanding statistical performance, and accelerating
Runtime Analysis of Population-based Evolutionary AlgorithmsPer Kristian Lehre
Populations are at the heart of evolutionary algorithms (EAs). They provide the genetic variation which selection acts upon. A complete picture of EAs can
only be obtained if we understand their population dynamics. A rich theory on runtime analysis (also called time-complexity analysis) of EAs has been
developed over the last 20 years. The goal of this theory is to show, via rigorous mathematical means, how the performance of EAs depends on their
parameter settings and the characteristics of the underlying fitness landscapes. Initially, runtime analysis of EAs was mostly restricted to
simplified EAs that do not employ large populations, such as the (1+1) EA. This tutorial introduces more recent techniques that enable runtime
analysis of EAs with realistic population sizes.
The tutorial begins with a brief overview of the population‐based EAs that are covered by the techniques. We recall the common stochastic selection
mechanisms and how to measure the selection pressure they induce. The main part of the tutorial covers in detail widely applicable techniques tailored to
the analysis of populations.
To illustrate how these techniques can be applied, we consider several fundamental questions: When are populations necessary for efficient
optimisation with EAs? What is the appropriate balance between exploration and exploitation and how does this depend on relationships between mutation and
selection rates? What determines an EA's tolerance for uncertainty, e.g. in form of noisy or partially available fitness?
Runtime Analysis of Population-based Evolutionary AlgorithmsPK Lehre
Populations are at the heart of evolutionary algorithms (EAs). They provide the genetic variation which selection acts upon. A complete picture of EAs can only be obtained if we understand their population dynamics. A rich theory on runtime analysis (also called time-complexity analysis) of EAs has been developed over the last 20 years. The goal of this theory is to show, via rigorous mathematical means, how the performance of EAs depends on their parameter settings and the characteristics of the underlying fitness landscapes. Initially, runtime analysis of EAs was mostly restricted to simplified EAs that do not employ large populations, such as the (1+1) EA. This tutorial introduces more recent techniques that enable runtime analysis of EAs with realistic population sizes.
The tutorial begins with a brief overview of the population‐based EAs that are covered by the techniques. We recall the common stochastic selection mechanisms and how to measure the selection pressure they induce. The main part of the tutorial covers in detail widely applicable techniques tailored to the analysis of populations. We discuss random family trees and branching processes, drift and concentration of measure in populations, and level‐based analyses.
To illustrate how these techniques can be applied, we consider several fundamental questions: When are populations necessary for efficient optimisation with EAs? What is the appropriate balance between exploration and exploitation and how does this depend on relationships between mutation and selection rates? What determines an EA's tolerance for uncertainty, e.g. in form of noisy or partially available fitness?
This tutorial was presented at the 2015 IEEE Congress on Evolutionary Computation at Sendai, Japan, May 25th 2015.
The document discusses the convolution integral and its relation to the Laplace transform. Specifically, it shows that the Laplace transform of the convolution of two functions f and g is equal to the product of the individual Laplace transforms, unlike the normal multiplication of functions. This defines a "generalized product" called the convolution. The document provides examples calculating convolutions and inverse Laplace transforms to demonstrate the convolution integral theorem.
The document discusses asymptotic analysis and asymptotic notation. It defines Big-Oh, Big-Omega, and Theta notation and provides examples. Some key points:
- Asymptotic analysis examines an algorithm's behavior for large inputs by analyzing its growth rate as the input size n approaches infinity.
- Common notations like O(n^2), Ω(n log n), θ(n) describe an algorithm's asymptotic growth in relation to standard functions.
- O notation describes asymptotic upper bounds, Ω describes lower bounds, and θ describes tight bounds between two functions.
- Theorems describe relationships between the notations and how to combine functions when using the notations.
UNLOCKING HEALTHCARE 4.0: NAVIGATING CRITICAL SUCCESS FACTORS FOR EFFECTIVE I...amsjournal
The Fourth Industrial Revolution is transforming industries, including healthcare, by integrating digital,
physical, and biological technologies. This study examines the integration of 4.0 technologies into
healthcare, identifying success factors and challenges through interviews with 70 stakeholders from 33
countries. Healthcare is evolving significantly, with varied objectives across nations aiming to improve
population health. The study explores stakeholders' perceptions on critical success factors, identifying
challenges such as insufficiently trained personnel, organizational silos, and structural barriers to data
exchange. Facilitators for integration include cost reduction initiatives and interoperability policies.
Technologies like IoT, Big Data, AI, Machine Learning, and robotics enhance diagnostics, treatment
precision, and real-time monitoring, reducing errors and optimizing resource utilization. Automation
improves employee satisfaction and patient care, while Blockchain and telemedicine drive cost reductions.
Successful integration requires skilled professionals and supportive policies, promising efficient resource
use, lower error rates, and accelerated processes, leading to optimized global healthcare outcomes.
Reinforcement learning: hidden theory, and new super-fast algorithms
Lecture presented at the Center for Systems and Control (CSC@USC) and Ming Hsieh Institute for Electrical Engineering,
February 21, 2018
Stochastic Approximation algorithms are used to approximate solutions to fixed point equations that involve expectations of functions with respect to possibly unknown distributions. The most famous examples today are TD- and Q-learning algorithms. The first half of this lecture will provide an overview of stochastic approximation, with a focus on optimizing the rate of convergence. A new approach to optimize the rate of convergence leads to the new Zap Q-learning algorithm. Analysis suggests that its transient behavior is a close match to a deterministic Newton-Raphson implementation, and numerical experiments confirm super fast convergence.
Based on
@article{devmey17a,
Title = {Fastest Convergence for {Q-learning}},
Author = {Devraj, Adithya M. and Meyn, Sean P.},
Journal = {NIPS 2017 and ArXiv e-prints},
Year = 2017}
Stochastic computation graphs provide a framework for automatically deriving unbiased gradient estimators. They generalize backpropagation to deal with random variables by treating the computation graph as a DAG with both deterministic and stochastic nodes. This allows gradients to be computed through expectations, enabling techniques like policy gradients for reinforcement learning and variational inference. The document describes several policy gradient methods that use stochastic computation graphs to compute gradients, including SVG(0), SVG(1), and DDPG. These methods have been successfully applied to robotics tasks and driving.
1. The document discusses natural policy gradients, Trust Region Policy Optimization (TRPO), and Proximal Policy Optimization (PPO) for deep reinforcement learning and control.
2. It explains that policy gradient methods aim to directly optimize the policy parameters but have challenges with step size selection and sample efficiency.
3. Natural policy gradients address this by taking steps in the policy space rather than parameter space, using the Fisher information matrix to define a natural gradient that considers how parameter changes affect the policy distribution. This makes step size selection easier by constraining changes to a threshold KL divergence in policy space.
Adaptive Multistage Sampling Algorithm: The Origins of Monte Carlo Tree SearchAshwin Rao
The document summarizes the Adaptive Multistage Sampling (AMS) algorithm, which is considered the "spiritual origin" of Monte Carlo Tree Search (MCTS). AMS is a generic simulation-based algorithm for solving finite-horizon Markov Decision Processes when the state space is very large. It works by allocating a fixed number of action selections per state over time steps, and using an explore-exploit formula to select actions during simulations. The estimates produced by AMS are asymptotically unbiased and converge to the optimal values.
This document provides an overview of reinforcement learning. It discusses the reinforcement learning framework including actors like agents, environments, states, actions, rewards, and policies. It also summarizes several common reinforcement learning methods including value-based methods, policy-based methods, and model-based methods. Value-based methods estimate value functions using algorithms like Q-learning and deep Q-networks. Policy-based methods directly learn policies using policy gradient algorithms like REINFORCE. Model-based methods learn models of the environment and then plan based on these models.
The document summarizes several advanced policy gradient methods for reinforcement learning, including trust region policy optimization (TRPO), proximal policy optimization (PPO), and using the natural policy gradient with the Kronecker-factored approximation (K-FAC). TRPO frames policy optimization as solving a constrained optimization problem to limit policy updates, while PPO uses a clipped objective function as a pessimistic bound. Both methods improve upon vanilla policy gradients. K-FAC provides an efficient way to approximate the natural policy gradient using the Fisher information matrix. The document reviews the theory and algorithms behind these methods.
Hierarchical Reinforcement Learning with Option-Critic ArchitectureNecip Oguz Serbetci
This document describes hierarchical reinforcement learning using the option-critic architecture. The option-critic architecture allows for online, end-to-end learning of options in continuous state and action spaces by learning the initiation, intra-option policy, and termination policy of options using deep reinforcement learning techniques like policy gradients. The option-critic architecture extends the options framework by allowing options to be represented by neural networks and learned online through actor-critic methods.
Approximate Bayesian Computation with Quasi-LikelihoodsStefano Cabras
This document describes ABC-MCMC algorithms that use quasi-likelihoods as proposals. It introduces quasi-likelihoods as approximations to true likelihoods that can be estimated from pilot runs. The ABCql algorithm uses a quasi-likelihood estimated from a pilot run as the proposal in an ABC-MCMC algorithm. Examples applying ABCql to mixture of normals, coalescent, and gamma models are provided to demonstrate its effectiveness compared to standard ABC-MCMC.
Introduction to reinforcement learning - Phu NguyenTu Le Dinh
The document introduces reinforcement learning and uses the example of creating a bot to play FlappyBird. It discusses what reinforcement learning is, including the agent-environment interaction and rewards. It also covers Markov decision processes, value functions, exploration vs exploitation, and algorithms like Q-learning. It concludes with a demo of a Deep Q-Network agent learning to play FlappyBird and recommends further courses and books on machine learning and reinforcement learning topics.
The document discusses control as inference in Markov decision processes (MDPs) and partially observable MDPs (POMDPs). It introduces optimality variables that represent whether a state-action pair is optimal or not. It formulates the optimal action-value function Q* and optimal value function V* in terms of these optimality variables and the reward and transition distributions. Q* is defined as the log probability of a state-action pair being optimal, and V* is defined as the log probability of a state being optimal. Bellman equations are derived relating Q* and V* to the reward and next state value.
The document discusses algorithm analysis and asymptotic notation. It defines algorithm analysis as comparing algorithms based on running time and other factors as problem size increases. Asymptotic notation such as Big-O, Big-Omega, and Big-Theta are introduced to classify algorithms based on how their running times grow relative to input size. Common time complexities like constant, logarithmic, linear, quadratic, and exponential are also covered. The properties and uses of asymptotic notation for equations and inequalities are explained.
This Presentation describes, in short, Introduction to Time Series and the overall procedure required for Time Series Modelling including general terminologies and algorithms. However the detailed Mathematics is excluded in the slides, this ppt means to give a start to understanding the Time Series Modelling before going into detailed Statistics.
Evolutionary Strategies as an alternative to Reinforcement LearningAshwin Rao
This document introduces Evolutionary Strategies (ES), a black-box optimization technique inspired by natural evolution. ES frames optimization as maximizing the expected value of an objective function over a parameterized probability distribution of candidate solutions. It uses stochastic gradient ascent to update the distribution's parameters based on sampled solutions and their objective values. The document applies ES to solving Markov Decision Processes by sampling candidate policies and estimating the gradient of expected return. While simpler than reinforcement learning, ES can be effective for optimization problems and more robust than advanced policy gradient methods.
This document provides an overview of reinforcement learning. It discusses how reinforcement learning differs from supervised and unsupervised learning by focusing on choosing actions to maximize long-term reward. It also formalizes reinforcement learning problems using Markov decision processes and describes algorithms like value iteration and policy iteration that can be used to find optimal policies.
The document discusses statistical estimation methods that use joint regularization. Specifically, it discusses using thresholding rules and nonconvex penalties within an additive robust framework for statistical regression. Key points:
- Thresholding rules Θ can induce nonconvex penalty functions P and allow reformulating regression as a proximity operator problem.
- An additive robust framework combines thresholding Θ with a ψ function to perform M-estimation, as long as Θ + ψ = identity.
- Generalized group sparsity pursuit extends this to multiple nonconvex penalties and response variables. An algorithm is developed using linearization and scaled thresholding rules.
- Challenges include analyzing convergence of nonconvex algorithms, understanding statistical performance, and accelerating
Runtime Analysis of Population-based Evolutionary AlgorithmsPer Kristian Lehre
Populations are at the heart of evolutionary algorithms (EAs). They provide the genetic variation which selection acts upon. A complete picture of EAs can
only be obtained if we understand their population dynamics. A rich theory on runtime analysis (also called time-complexity analysis) of EAs has been
developed over the last 20 years. The goal of this theory is to show, via rigorous mathematical means, how the performance of EAs depends on their
parameter settings and the characteristics of the underlying fitness landscapes. Initially, runtime analysis of EAs was mostly restricted to
simplified EAs that do not employ large populations, such as the (1+1) EA. This tutorial introduces more recent techniques that enable runtime
analysis of EAs with realistic population sizes.
The tutorial begins with a brief overview of the population‐based EAs that are covered by the techniques. We recall the common stochastic selection
mechanisms and how to measure the selection pressure they induce. The main part of the tutorial covers in detail widely applicable techniques tailored to
the analysis of populations.
To illustrate how these techniques can be applied, we consider several fundamental questions: When are populations necessary for efficient
optimisation with EAs? What is the appropriate balance between exploration and exploitation and how does this depend on relationships between mutation and
selection rates? What determines an EA's tolerance for uncertainty, e.g. in form of noisy or partially available fitness?
Runtime Analysis of Population-based Evolutionary AlgorithmsPK Lehre
Populations are at the heart of evolutionary algorithms (EAs). They provide the genetic variation which selection acts upon. A complete picture of EAs can only be obtained if we understand their population dynamics. A rich theory on runtime analysis (also called time-complexity analysis) of EAs has been developed over the last 20 years. The goal of this theory is to show, via rigorous mathematical means, how the performance of EAs depends on their parameter settings and the characteristics of the underlying fitness landscapes. Initially, runtime analysis of EAs was mostly restricted to simplified EAs that do not employ large populations, such as the (1+1) EA. This tutorial introduces more recent techniques that enable runtime analysis of EAs with realistic population sizes.
The tutorial begins with a brief overview of the population‐based EAs that are covered by the techniques. We recall the common stochastic selection mechanisms and how to measure the selection pressure they induce. The main part of the tutorial covers in detail widely applicable techniques tailored to the analysis of populations. We discuss random family trees and branching processes, drift and concentration of measure in populations, and level‐based analyses.
To illustrate how these techniques can be applied, we consider several fundamental questions: When are populations necessary for efficient optimisation with EAs? What is the appropriate balance between exploration and exploitation and how does this depend on relationships between mutation and selection rates? What determines an EA's tolerance for uncertainty, e.g. in form of noisy or partially available fitness?
This tutorial was presented at the 2015 IEEE Congress on Evolutionary Computation at Sendai, Japan, May 25th 2015.
The document discusses the convolution integral and its relation to the Laplace transform. Specifically, it shows that the Laplace transform of the convolution of two functions f and g is equal to the product of the individual Laplace transforms, unlike the normal multiplication of functions. This defines a "generalized product" called the convolution. The document provides examples calculating convolutions and inverse Laplace transforms to demonstrate the convolution integral theorem.
The document discusses asymptotic analysis and asymptotic notation. It defines Big-Oh, Big-Omega, and Theta notation and provides examples. Some key points:
- Asymptotic analysis examines an algorithm's behavior for large inputs by analyzing its growth rate as the input size n approaches infinity.
- Common notations like O(n^2), Ω(n log n), θ(n) describe an algorithm's asymptotic growth in relation to standard functions.
- O notation describes asymptotic upper bounds, Ω describes lower bounds, and θ describes tight bounds between two functions.
- Theorems describe relationships between the notations and how to combine functions when using the notations.
UNLOCKING HEALTHCARE 4.0: NAVIGATING CRITICAL SUCCESS FACTORS FOR EFFECTIVE I...amsjournal
The Fourth Industrial Revolution is transforming industries, including healthcare, by integrating digital,
physical, and biological technologies. This study examines the integration of 4.0 technologies into
healthcare, identifying success factors and challenges through interviews with 70 stakeholders from 33
countries. Healthcare is evolving significantly, with varied objectives across nations aiming to improve
population health. The study explores stakeholders' perceptions on critical success factors, identifying
challenges such as insufficiently trained personnel, organizational silos, and structural barriers to data
exchange. Facilitators for integration include cost reduction initiatives and interoperability policies.
Technologies like IoT, Big Data, AI, Machine Learning, and robotics enhance diagnostics, treatment
precision, and real-time monitoring, reducing errors and optimizing resource utilization. Automation
improves employee satisfaction and patient care, while Blockchain and telemedicine drive cost reductions.
Successful integration requires skilled professionals and supportive policies, promising efficient resource
use, lower error rates, and accelerated processes, leading to optimized global healthcare outcomes.
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODELgerogepatton
As digital technology becomes more deeply embedded in power systems, protecting the communication
networks of Smart Grids (SG) has emerged as a critical concern. Distributed Network Protocol 3 (DNP3)
represents a multi-tiered application layer protocol extensively utilized in Supervisory Control and Data
Acquisition (SCADA)-based smart grids to facilitate real-time data gathering and control functionalities.
Robust Intrusion Detection Systems (IDS) are necessary for early threat detection and mitigation because
of the interconnection of these networks, which makes them vulnerable to a variety of cyberattacks. To
solve this issue, this paper develops a hybrid Deep Learning (DL) model specifically designed for intrusion
detection in smart grids. The proposed approach is a combination of the Convolutional Neural Network
(CNN) and the Long-Short-Term Memory algorithms (LSTM). We employed a recent intrusion detection
dataset (DNP3), which focuses on unauthorized commands and Denial of Service (DoS) cyberattacks, to
train and test our model. The results of our experiments show that our CNN-LSTM method is much better
at finding smart grid intrusions than other deep learning algorithms used for classification. In addition,
our proposed approach improves accuracy, precision, recall, and F1 score, achieving a high detection
accuracy rate of 99.50%.
Understanding Inductive Bias in Machine LearningSUTEJAS
This presentation explores the concept of inductive bias in machine learning. It explains how algorithms come with built-in assumptions and preferences that guide the learning process. You'll learn about the different types of inductive bias and how they can impact the performance and generalizability of machine learning models.
The presentation also covers the positive and negative aspects of inductive bias, along with strategies for mitigating potential drawbacks. We'll explore examples of how bias manifests in algorithms like neural networks and decision trees.
By understanding inductive bias, you can gain valuable insights into how machine learning models work and make informed decisions when building and deploying them.
Introduction- e - waste – definition - sources of e-waste– hazardous substances in e-waste - effects of e-waste on environment and human health- need for e-waste management– e-waste handling rules - waste minimization techniques for managing e-waste – recycling of e-waste - disposal treatment methods of e- waste – mechanism of extraction of precious metal from leaching solution-global Scenario of E-waste – E-waste in India- case studies.
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...IJECEIAES
Climate change's impact on the planet forced the United Nations and governments to promote green energies and electric transportation. The deployments of photovoltaic (PV) and electric vehicle (EV) systems gained stronger momentum due to their numerous advantages over fossil fuel types. The advantages go beyond sustainability to reach financial support and stability. The work in this paper introduces the hybrid system between PV and EV to support industrial and commercial plants. This paper covers the theoretical framework of the proposed hybrid system including the required equation to complete the cost analysis when PV and EV are present. In addition, the proposed design diagram which sets the priorities and requirements of the system is presented. The proposed approach allows setup to advance their power stability, especially during power outages. The presented information supports researchers and plant owners to complete the necessary analysis while promoting the deployment of clean energy. The result of a case study that represents a dairy milk farmer supports the theoretical works and highlights its advanced benefits to existing plants. The short return on investment of the proposed approach supports the paper's novelty approach for the sustainable electrical system. In addition, the proposed system allows for an isolated power setup without the need for a transmission line which enhances the safety of the electrical network
Literature Review Basics and Understanding Reference Management.pptxDr Ramhari Poudyal
Three-day training on academic research focuses on analytical tools at United Technical College, supported by the University Grant Commission, Nepal. 24-26 May 2024
2. Overview
1 Motivation and Intuition
2 Definitions and Notation
3 Policy Gradient Theorem and Proof
4 Policy Gradient Algorithms
5 Compatible Function Approximation Theorem and Proof
6 Natural Policy Gradient
Ashwin Rao (Stanford) Policy Gradient Algorithms 2 / 33
3. Why do we care about Policy Gradient (PG)?
Let us review how we got here
We started with Markov Decision Processes and Bellman Equations
Next we studied several variants of DP and RL algorithms
We noted that the idea of Generalized Policy Iteration (GPI) is key
Policy Improvement step: π(a|s) derived from argmaxa Q(s, a)
How do we do argmax when action space is large or continuous?
Idea: Do Policy Improvement step with a Gradient Ascent instead
Ashwin Rao (Stanford) Policy Gradient Algorithms 3 / 33
4. “Policy Improvement with a Gradient Ascent??”
We want to find the Policy that fetches the “Best Expected Returns”
Gradient Ascent on “Expected Returns” w.r.t params of Policy func
So we need a func approx for (stochastic) Policy Func: π(s, a; θ)
In addition to the usual func approx for Action Value Func: Q(s, a; w)
π(s, a; θ) func approx called Actor, Q(s, a; w) func approx called Critic
Critic parameters w are optimized w.r.t Q(s, a; w) loss function min
Actor parameters θ are optimized w.r.t Expected Returns max
We need to formally define “Expected Returns”
But we already see that this idea is appealing for continuous actions
GPI with Policy Improvement done as Policy Gradient (Ascent)
Ashwin Rao (Stanford) Policy Gradient Algorithms 4 / 33
5. Value Function-based and Policy-based RL
Value Function-based
Learn Value Function (with a function approximation)
Policy is implicit - readily derived from Value Function (eg: -greedy)
Policy-based
Learn Policy (with a function approximation)
No need to learn a Value Function
Actor-Critic
Learn Policy (Actor)
Learn Value Function (Critic)
Ashwin Rao (Stanford) Policy Gradient Algorithms 5 / 33
6. Advantages and Disadvantages of Policy Gradient approach
Advantages:
Finds the best Stochastic Policy (Optimal Deterministic Policy,
produced by other RL algorithms, can be unsuitable for POMDPs)
Naturally explores due to Stochastic Policy representation
Effective in high-dimensional or continuous action spaces
Small changes in θ ⇒ small changes in π, and in state distribution
This avoids the convergence issues seen in argmax-based algorithms
Disadvantages:
Typically converge to a local optimum rather than a global optimum
Policy Evaluation is typically inefficient and has high variance
Policy Improvement happens in small steps ⇒ slow convergence
Ashwin Rao (Stanford) Policy Gradient Algorithms 6 / 33
7. Notation
Discount Factor γ
Assume episodic with 0 ≤ γ ≤ 1 or non-episodic with 0 ≤ γ 1
States st ∈ S, Actions at ∈ A, Rewards rt ∈ R, ∀t ∈ {0, 1, 2, . . .}
State Transition Probabilities Pa
s,s0 = Pr(st+1 = s0|st = s, at = a)
Expected Rewards Ra
s = E[rt|st = s, at = a]
Initial State Probability Distribution p0 : S → [0, 1]
Policy Func Approx π(s, a; θ) = Pr(at = a|st = s, θ), θ ∈ Rk
PG coverage will be quite similar for non-discounted non-episodic, by
considering average-reward objective (so we won’t cover it)
Ashwin Rao (Stanford) Policy Gradient Algorithms 7 / 33
8. “Expected Returns” Objective
Now we formalize the “Expected Returns” Objective J(θ)
J(θ) = Eπ[
∞
X
t=0
γt
rt]
Value Function V π(s) and Action Value function Qπ(s, a) defined as:
V π
(s) = Eπ[
∞
X
k=t
γk−t
rk|st = s], ∀t ∈ {0, 1, 2, . . .}
Qπ
(s, a) = Eπ[
∞
X
k=t
γk−t
rk|st = s, at = a], ∀t ∈ {0, 1, 2, . . .}
Advantage Function Aπ
(s, a) = Qπ
(s, a) − V π
(s)
Also, p(s → s0, t, π) will be a key function for us - it denotes the
probability of going from state s to s0 in t steps by following policy π
Ashwin Rao (Stanford) Policy Gradient Algorithms 8 / 33
9. Discounted-Aggregate State-Visitation Measure
J(θ) = Eπ[
∞
X
t=0
γt
rt] =
∞
X
t=0
γt
Eπ[rt]
=
∞
X
t=0
γt
Z
S
(
Z
S
p0(s0) · p(s0 → s, t, π) · ds0)
Z
A
π(s, a; θ) · Ra
s · da · ds
=
Z
S
(
Z
S
∞
X
t=0
γt
· p0(s0) · p(s0 → s, t, π) · ds0)
Z
A
π(s, a; θ) · Ra
s · da · ds
Definition
J(θ) =
Z
S
ρπ
(s)
Z
A
π(s, a; θ) · Ra
s · da · ds
where ρπ(s) =
R
S
P∞
t=0 γt · p0(s0) · p(s0 → s, t, π) · ds0 is the key function
(for PG) we’ll refer to as Discounted-Aggregate State-Visitation Measure.
Ashwin Rao (Stanford) Policy Gradient Algorithms 9 / 33
10. Policy Gradient Theorem (PGT)
Theorem
∇θJ(θ) =
Z
S
ρπ
(s)
Z
A
∇θπ(s, a; θ) · Qπ
(s, a) · da · ds
Note: ρπ(s) depends on θ, but there’s no ∇θρπ(s) term in ∇θJ(θ)
So we can simply sample simulation paths, and at each time step, we
calculate (∇θ log π(s, a; θ)) · Qπ(s, a) (probabilities implicit in paths)
Note: ∇θ log π(s, a; θ) is Score function (Gradient of log-likelihood)
We will estimate Qπ(s, a) with a function approximation Q(s, a; w)
We will later show how to avoid the estimate bias of Q(s, a; w)
This numerical estimate of ∇θJ(θ) enables Policy Gradient Ascent
Let us look at the score function of some canonical π(s, a; θ)
Ashwin Rao (Stanford) Policy Gradient Algorithms 10 / 33
11. Canonical π(s, a; θ) for finite action spaces
For finite action spaces, we often use Softmax Policy
θ is an n-vector (θ1, . . . , θn)
Features vector φ(s, a) = (φ1(s, a), . . . , φn(s, a)) for all s ∈ S, a ∈ A
Weight actions using linear combinations of features: θT · φ(s, a)
Action probabilities proportional to exponentiated weights:
π(s, a; θ) =
eθT ·φ(s,a)
P
b eθT ·φ(s,b)
for all s ∈ S, a ∈ A
The score function is:
∇θ log π(s, a; θ) = φ(s, a)−
X
b
π(s, b; θ)·φ(s, b) = φ(s, a)−Eπ[φ(s, ·)]
Ashwin Rao (Stanford) Policy Gradient Algorithms 11 / 33
12. Canonical π(s, a; θ) for continuous action spaces
For continuous action spaces, we often use Gaussian Policy
θ is an n-vector (θ1, . . . , θn)
State features vector φ(s) = (φ1(s), . . . , φn(s)) for all s ∈ S
Gaussian Mean is a linear combination of state features θT · φ(s)
Variance may be fixed σ2, or can also be parameterized
Policy is Gaussian, a ∼ N(θT · φ(s), σ2) for all s ∈ S
The score function is:
∇θ log π(s, a; θ) =
(a − θT · φ(s)) · φ(s)
σ2
Ashwin Rao (Stanford) Policy Gradient Algorithms 12 / 33
13. Proof of Policy Gradient Theorem
We begin the proof by noting that:
J(θ) =
Z
S
p0(s0)·V π
(s0)·ds0 =
Z
S
p0(s0)
Z
A
π(s0, a0; θ)·Qπ
(s0, a0)·da0·ds0
Calculate ∇θJ(θ) by parts π(s0, a0; θ) and Qπ(s0, a0)
∇θJ(θ) =
Z
S
p0(s0)
Z
A
∇θπ(s0, a0; θ) · Qπ
(s0, a0) · da0 · ds0
+
Z
S
p0(s0)
Z
A
π(s0, a0; θ) · ∇θQπ
(s0, a0) · da0 · ds0
Ashwin Rao (Stanford) Policy Gradient Algorithms 13 / 33
14. Proof of Policy Gradient Theorem
Now expand Qπ(s0, a0) as Ra0
s0
+
R
S γ · Pa0
s0,s1
· V π(s1) · ds1 (Bellman)
=
Z
S
p0(s0)
Z
A
∇θπ(s0, a0; θ) · Qπ
(s0, a0) · da0 · ds0
+
Z
S
p0(s0)
Z
A
π(s0, a0; θ) · ∇θ(Ra0
s0
+
Z
S
γ · Pa0
s0,s1
· V π
(s1) · ds1) · da0 · ds0
Note: ∇θRa0
s0
= 0, so remove that term
=
Z
S
p0(s0)
Z
A
∇θπ(s0, a0; θ) · Qπ
(s0, a) · da0 · ds0
+
Z
S
p0(s0)
Z
A
π(s0, a0; θ) · ∇θ(
Z
S
γ · Pa0
s0,s1
· V π
(s1) · ds1) · da0 · ds0
Ashwin Rao (Stanford) Policy Gradient Algorithms 14 / 33
15. Proof of Policy Gradient Theorem
Now bring the ∇θ inside the
R
S to apply only on V π(s1)
=
Z
S
p0(s0)
Z
A
∇θπ(s0, a0; θ) · Qπ
(s0, a) · da0 · ds0
+
Z
S
p0(s0)
Z
A
π(s0, a0; θ)
Z
S
γ · Pa0
s0,s1
· ∇θV π
(s1) · ds1 · da0 · ds0
Now bring the outside
R
S and
R
A inside the inner
R
S
=
Z
S
p0(s0)
Z
A
∇θπ(s0, a0; θ) · Qπ
(s0, a0) · da0 · ds0
+
Z
S
(
Z
S
γ · p0(s0)
Z
A
π(s0, a0; θ) · Pa0
s0,s1
· da0 · ds0) · ∇θV π
(s1) · ds1
Ashwin Rao (Stanford) Policy Gradient Algorithms 15 / 33
16. Policy Gradient Theorem
Note that
R
A π(s0, a0; θ) · Pa0
s0,s1
· da0 = p(s0 → s1, 1, π)
=
Z
S
p0(s0)
Z
A
·∇θπ(s0, a0; θ) · Qπ
(s0, a0) · da0 · ds0
+
Z
S
(
Z
S
γ · p0(s0) · p(s0 → s1, 1, π) · ds0) · ∇θV π
(s1) · ds1
Now expand V π(s1) to
R
A π(s1, a1; θ) · Qπ(s1, a1) · da1
=
Z
S
p0(s0)
Z
A
·∇θπ(s0, a0; θ) · Qπ
(s0, a0) · da · ds0
+
Z
S
(
Z
S
γ · p0(s0)p(s0 → s1, 1, π)ds0) · ∇θ(
Z
A
π(s1, a1; θ)Qπ
(s1, a1)da1)ds1
Ashwin Rao (Stanford) Policy Gradient Algorithms 16 / 33
17. Proof of Policy Gradient Theorem
We are now back to when we started calculating gradient of
R
A π · Qπ · da.
Follow the same process of splitting π · Qπ, then Bellman-expanding Qπ
(to calculate its gradient), and iterate.
=
Z
S
p0(s0)
Z
A
·∇θπ(s0, a0; θ) · Qπ
(s0, a0) · da0 · ds0
+
Z
S
Z
S
γp0(s0)p(s0 → s1, 1, π)ds0(
Z
A
∇θπ(s1, a1; θ)Qπ
(s1, a1)da1 + . . .)ds1
This iterative process leads us to:
=
∞
X
t=0
Z
S
Z
S
γt
·p0(s0)·p(s0 → st, t, π)·ds0
Z
A
∇θπ(st, at; θ)·Qπ
(st, at)·dat·dst
Ashwin Rao (Stanford) Policy Gradient Algorithms 17 / 33
18. Proof of Policy Gradient Theorem
Bring
P∞
t=0 inside the two
R
S, and note that
R
A ∇θπ(st, at; θ) · Qπ(st, at) · dat is independent of t.
=
Z
S
Z
S
∞
X
t=0
γt
·p0(s0)·p(s0 → s, t, π)·ds0
Z
A
∇θπ(s, a; θ)·Qπ
(s, a)·da·ds
Reminder that
R
S
P∞
t=0 γt · p0(s0) · p(s0 → s, t, π) · ds0
def
= ρπ(s). So,
∇θJ(θ) =
Z
S
ρπ
(s)
Z
A
∇θπ(s, a; θ) · Qπ
(s, a) · da · ds
Q.E.D.
Ashwin Rao (Stanford) Policy Gradient Algorithms 18 / 33
19. Monte-Carlo Policy Gradient (REINFORCE Algorithm)
Update θ by stochastic gradient ascent using PGT
Using Gt =
PT
k=t γk−t · rk as an unbiased sample of Qπ(st, at)
∆θt = α · γt
· ∇θ log π(st, at; θ) · Gt
Algorithm 4.1: REINFORCE(·)
Initialize θ arbitrarily
for each episode {s0, a0, r0, . . . , sT , aT , rT } ∼ π(·, ·; θ)
do
for t ← 0 to T
do
G ←
PT
k=t γk−t · rk
θ ← θ + α · γt · ∇θ log π(st, at; θ) · G
return (θ)
Ashwin Rao (Stanford) Policy Gradient Algorithms 19 / 33
20. Reducing Variance using a Critic
Monte Carlo Policy Gradient has high variance
We use a Critic Q(s, a; w) to estimate Qπ(s, a)
Actor-Critic algorithms maintain two sets of parameters:
Critic updates parameters w to approximate Q-function for policy π
Critic could use any of the algorithms we learnt earlier:
Monte Carlo policy evaluation
Temporal-Difference Learning
TD(λ) based on Eligibility Traces
Could even use LSTD (if critic function approximation is linear)
Actor updates policy parameters θ in direction suggested by Critic
This is Approximate Policy Gradient due to Bias of Critic
∇θJ(θ) ≈
Z
S
ρπ
(s)
Z
A
∇θπ(s, a; θ) · Q(s, a; w) · da · ds
Ashwin Rao (Stanford) Policy Gradient Algorithms 20 / 33
21. So what does the algorithm look like?
Generate a sufficient set of simulation paths s0, a0, r0, s1, a1, r1, . . .
s0 is sampled from the distribution p0(·)
at is sampled from π(st, ·; θ)
st+1 sampled from transition probs and rt+1 from reward func
At each time step t, update w proportional to gradient of appropriate
(MC or TD-based) loss function of Q(s, a; w)
Sum γt · ∇θ log π(st, at; θ) · Q(st, at; w) over t and over paths
Update θ using this (biased) estimate of ∇θJ(θ)
Iterate with a new set of simulation paths ...
Ashwin Rao (Stanford) Policy Gradient Algorithms 21 / 33
22. Reducing Variance with a Baseline
We can reduce variance by subtracting a baseline function B(s) from
Q(s, a; w) in the Policy Gradient estimate
This means at each time step, we replace
γt · ∇θ log π(st, at; θ) · Q(st, at; w) with
γt · ∇θ log π(st, at; θ) · (Q(st, at; w) − B(s))
Note that Baseline function B(s) is only a function of s (and not a)
This ensures that subtracting Baseline B(s) does not add bias
Z
S
ρπ
(s)
Z
A
∇θπ(s, a; θ) · B(s) · da · ds
=
Z
S
ρπ
(s) · B(s) · ∇θ(
Z
A
π(s, a; θ) · da) · ds = 0
Ashwin Rao (Stanford) Policy Gradient Algorithms 22 / 33
23. Using State Value function as Baseline
A good baseline B(s) is state value function V (s; v)
Rewrite Policy Gradient algorithm using advantage function estimate
A(s, a; w, v) = Q(s, a; w) − V (s; v)
Now the estimate of ∇θJ(θ) is given by:
Z
S
ρπ
(s)
Z
A
∇θπ(s, a; θ) · A(s, a; w, v) · da · ds
At each time step, we update both sets of parameters w and v
Ashwin Rao (Stanford) Policy Gradient Algorithms 23 / 33
24. TD Error as estimate of Advantage Function
Consider TD error δπ for the true Value Function V π(s)
δπ
= r + γV π
(s0
) − V π
(s)
δπ is an unbiased estimate of Advantage function Aπ(s, a)
Eπ[δπ
|s, a] = Eπ[r+γV π
(s0
)|s, a]−V π
(s) = Qπ
(s, a)−V π
(s) = Aπ
(s, a)
So we can write Policy Gradient in terms of Eπ[δπ|s, a]
∇θJ(θ) =
Z
S
ρπ
(s)
Z
A
∇θπ(s, a; θ) · Eπ[δπ
|s, a] · da · ds
In practice, we can use func approx for TD error (and sample):
δ(s, r, s0
; v) = r + γV (s0
; v) − V (s; v)
This approach requires only one set of critic parameters v
Ashwin Rao (Stanford) Policy Gradient Algorithms 24 / 33
25. TD Error can be used by both Actor and Critic
Algorithm 4.2: ACTOR-CRITIC-TD-ERROR(·)
Initialize Policy params θ ∈ Rm and State VF params v ∈ Rn arbitrarily
for each episode
do
Initialize s (first state of episode)
P ← 1
while s is not terminal
do
a ∼ π(s, ·; θ)
Take action a, observe r, s0
δ ← r + γV (s0; v) − V (s; v)
v ← v + αv · δ · ∇v V (s; v)
θ ← θ + αθ · P · δ · ∇θ log π(s, a; θ)
P ← γP
s ← s0
Ashwin Rao (Stanford) Policy Gradient Algorithms 25 / 33
26. Using Eligibility Traces for both Actor and Critic
Algorithm 4.3: ACTOR-CRITIC-ELIGIBILITY-TRACES(·)
Initialize Policy params θ ∈ Rm and State VF params v ∈ Rn arbitrarily
for each episode
do
Initialize s (first state of episode)
zθ, zv ← 0 (m and n components eligibility trace vectors)
P ← 1
while s is not terminal
do
a ∼ π(s, ·; θ)
Take action a, observe r, s0
δ ← r + γV (s0; v) − V (s; v)
zv ← γ · λv · zv + ∇v V (s; v)
zθ ← γ · λθ · zθ + P · ∇θ log π(s, a; θ)
v ← v + αv · δ · zv
θ ← θ + αθ · δ · zθ
P ← γP, s ← s0
Ashwin Rao (Stanford) Policy Gradient Algorithms 26 / 33
27. Overcoming Bias
We’ve learnt a few ways of how to reduce variance
But we haven’t discussed how to overcome bias
All of the following substitutes for Qπ(s, a) in PG have bias:
Q(s, a; w)
A(s, a; w, v)
δ(s, s0
, r; v)
Turns out there is indeed a way to overcome bias
It is called the Compatible Function Approximation Theorem
Ashwin Rao (Stanford) Policy Gradient Algorithms 27 / 33
28. Compatible Function Approximation Theorem
Theorem
If the following two conditions are satisfied:
1 Critic gradient is compatible with the Actor score function
∇w Q(s, a; w) = ∇θ log π(s, a; θ)
2 Critic parameters w minimize the following mean-squared error:
=
Z
S
ρπ
(s)
Z
A
π(s, a; θ)(Qπ
(s, a) − Q(s, a; w))2
· da · ds
Then the Policy Gradient using critic Q(s, a; w) is exact:
∇θJ(θ) =
Z
S
ρπ
(s)
Z
A
∇θπ(s, a; θ) · Q(s, a; w) · da · ds
Ashwin Rao (Stanford) Policy Gradient Algorithms 28 / 33
29. Proof of Compatible Function Approximation Theorem
For w that minimizes
=
Z
S
ρπ
(s)
Z
A
π(s, a; θ) · (Qπ
(s, a) − Q(s, a; w))2
· da · ds,
Z
S
ρπ
(s)
Z
A
π(s, a; θ) · (Qπ
(s, a) − Q(s, a; w)) · ∇w Q(s, a; w) · da · ds = 0
But since ∇w Q(s, a; w) = ∇θ log π(s, a; θ), we have:
Z
S
ρπ
(s)
Z
A
π(s, a; θ) · (Qπ
(s, a) − Q(s, a; w)) · ∇θ log π(s, a; θ) · da · ds = 0
Therefore,
Z
S
ρπ
(s)
Z
A
π(s, a; θ) · Qπ
(s, a) · ∇θ log π(s, a; θ) · da · ds
=
Z
S
ρπ
(s)
Z
A
π(s, a; θ) · Q(s, a; w) · ∇θ log π(s, a; θ) · da · ds
Ashwin Rao (Stanford) Policy Gradient Algorithms 29 / 33
30. Proof of Compatible Function Approximation Theorem
But ∇θJ(θ) =
Z
S
ρπ
(s)
Z
A
π(s, a; θ) · Qπ
(s, a) · ∇θ log π(s, a; θ) · da · ds
So, ∇θJ(θ) =
Z
S
ρπ
(s)
Z
A
π(s, a; θ) · Q(s, a; w) · ∇θ log π(s, a; θ) · da · ds
=
Z
S
ρπ
(s)
Z
A
∇θπ(s, a; θ) · Q(s, a; w) · da · ds
Q.E.D.
This means with conditions (1) and (2) of Compatible Function
Approximation Theorem, we can use the critic func approx
Q(s, a; w) and still have the exact Policy Gradient.
Ashwin Rao (Stanford) Policy Gradient Algorithms 30 / 33
31. How to enable Compatible Function Approximation
A simple way to enable Compatible Function Approximation
∂Q(s,a;w)
∂wi
= ∂ log π(s,a;θ)
∂θi
, ∀i is to set Q(s, a; w) to be linear in its features.
Q(s, a; w) =
n
X
i=1
φi (s, a) · wi =
n
X
i=1
∂ log π(s, a; θ)
∂θi
· wi
We note below that a compatible Q(s, a; w) serves as an approximation of
the advantage function.
Z
A
π(s, a; θ) · Q(s, a; w) · da =
Z
A
π(s, a; θ) · (
n
X
i=1
∂ log π(s, a; θ)
∂θi
· wi ) · da
=
Z
A
·(
n
X
i=1
∂π(s, a; θ)
∂θi
· wi ) · da =
n
X
i=1
(
Z
A
∂π(s, a; θ)
∂θi
· da) · wi
=
n
X
i=1
∂
∂θi
(
Z
A
π(s, a; θ) · da) · wi =
n
X
i=1
∂1
∂θi
· wi = 0
Ashwin Rao (Stanford) Policy Gradient Algorithms 31 / 33
32. Fisher Information Matrix
Denoting [∂ log π(s,a;θ)
∂θi
], i = 1, . . . , n as the score column vector SC(s, a; θ)
and assuming compatible linear-approx critic:
∇θJ(θ) =
Z
S
ρπ
(s)
Z
A
π(s, a; θ) · (SC(s, a; θ) · SC(s, a; θ)T
· w) · da · ds
= Es∼ρπ,a∼π[SC(s, a; θ) · SC(s, a; θ)T
] · w
= FIMρπ,π(θ) · w
where FIMρπ,π(θ) is the Fisher Information Matrix w.r.t. s ∼ ρπ, a ∼ π.
Ashwin Rao (Stanford) Policy Gradient Algorithms 32 / 33
33. Natural Policy Gradient
Recall the idea of Natural Gradient from Numerical Optimization
Natural gradient ∇nat
θ J(θ) is the direction of optimal θ movement
In terms of the KL-divergence metric (versus plain Euclidean norm)
Natural gradient yields better convergence (we won’t cover proof)
Formally defined as: ∇θJ(θ) = FIMρπ,π(θ) · ∇nat
θ J(θ)
Therefore, ∇nat
θ J(θ) = w
This compact result is great for our algorithm:
Update Critic params w with the critic loss gradient (at step t) as:
γt
· (rt + γ · SC(st+1, at+1, θ) · w − SC(st, at, θ) · w) · SC(st, at, θ)
Update Actor params θ in the direction equal to value of w
Ashwin Rao (Stanford) Policy Gradient Algorithms 33 / 33