Stochastic computation graphs provide a framework for automatically deriving unbiased gradient estimators. They generalize backpropagation to deal with random variables by treating the computation graph as a DAG with both deterministic and stochastic nodes. This allows gradients to be computed through expectations, enabling techniques like policy gradients for reinforcement learning and variational inference. The document describes several policy gradient methods that use stochastic computation graphs to compute gradients, including SVG(0), SVG(1), and DDPG. These methods have been successfully applied to robotics tasks and driving.
The asynchronous parallel algorithms are developed to solve massive optimization problems in a distributed data system, which can be run in parallel on multiple nodes with little or no synchronization. Recently they have been successfully implemented to solve a range of difficult problems in practice. However, the existing theories are mostly based on fairly restrictive assumptions on the delays, and cannot explain the convergence and speedup properties of such algorithms. In this talk we will give an overview on distributed optimization, and discuss some new theoretical results on the convergence of asynchronous parallel stochastic gradient algorithm with unbounded delays. Simulated and real data will be used to demonstrate the practical implication of these theoretical results.
In this talk, we discuss some recent advances in probabilistic schemes for high-dimensional PIDEs. It is known that traditional PDE solvers, e.g., finite element, finite difference methods, do not scale well with the increase of dimension. The idea of probabilistic schemes is to link a wide class of nonlinear parabolic PIDEs to stochastic Levy processes based on nonlinear version of the Feynman-Kac theory. As such, the solution of the PIDE can be represented by a conditional expectation (i.e., a high-dimensional integral) with respect to a stochastic dynamical system driven by Levy processes. In other words, we can solve the PIDEs by performing high-dimensional numerical integration. A variety of quadrature methods could be applied, including MC, QMC, sparse grids, etc. The probabilistic schemes have been used in many application problems, e.g., particle transport in plasmas (e.g., Vlasov-Fokker-Planck equations), nonlinear filtering (e.g., Zakai equations), and option pricing, etc.
The asynchronous parallel algorithms are developed to solve massive optimization problems in a distributed data system, which can be run in parallel on multiple nodes with little or no synchronization. Recently they have been successfully implemented to solve a range of difficult problems in practice. However, the existing theories are mostly based on fairly restrictive assumptions on the delays, and cannot explain the convergence and speedup properties of such algorithms. In this talk we will give an overview on distributed optimization, and discuss some new theoretical results on the convergence of asynchronous parallel stochastic gradient algorithm with unbounded delays. Simulated and real data will be used to demonstrate the practical implication of these theoretical results.
In this talk, we discuss some recent advances in probabilistic schemes for high-dimensional PIDEs. It is known that traditional PDE solvers, e.g., finite element, finite difference methods, do not scale well with the increase of dimension. The idea of probabilistic schemes is to link a wide class of nonlinear parabolic PIDEs to stochastic Levy processes based on nonlinear version of the Feynman-Kac theory. As such, the solution of the PIDE can be represented by a conditional expectation (i.e., a high-dimensional integral) with respect to a stochastic dynamical system driven by Levy processes. In other words, we can solve the PIDEs by performing high-dimensional numerical integration. A variety of quadrature methods could be applied, including MC, QMC, sparse grids, etc. The probabilistic schemes have been used in many application problems, e.g., particle transport in plasmas (e.g., Vlasov-Fokker-Planck equations), nonlinear filtering (e.g., Zakai equations), and option pricing, etc.
A new transformation into State Transition Algorithm for finding the global m...Michael_Chou
To promote the global search ability of the original state transition algorithm, a new operator called axesion is suggested, which aims to search along the axes and strengthen single dimensional search. Several benchmark minimization
problems are used to illustrate the advantages of the improved algorithm over other random search methods. The results of
numerical experiments show that the new transformation can enhance the performance of the state transition algorithm and the new strategy is effective and reliable.
We develop a multi-scale streaming anomaly score that takes into account a family of window sizes, making the algorithm scale invariant across a different types of time series with varying pseudo-periodic structure. We explore different aggregation methods of the multi-scale anomaly score to obtain a final anomaly score. We evaluate the performance on the Yahoo! and Numenta Anomaly Benchmark(NAB) datasets.
A new transformation into State Transition Algorithm for finding the global m...Michael_Chou
To promote the global search ability of the original state transition algorithm, a new operator called axesion is suggested, which aims to search along the axes and strengthen single dimensional search. Several benchmark minimization
problems are used to illustrate the advantages of the improved algorithm over other random search methods. The results of
numerical experiments show that the new transformation can enhance the performance of the state transition algorithm and the new strategy is effective and reliable.
We develop a multi-scale streaming anomaly score that takes into account a family of window sizes, making the algorithm scale invariant across a different types of time series with varying pseudo-periodic structure. We explore different aggregation methods of the multi-scale anomaly score to obtain a final anomaly score. We evaluate the performance on the Yahoo! and Numenta Anomaly Benchmark(NAB) datasets.
We approach the screening problem - i.e. detecting which inputs of a computer model significantly impact the output - from a formal Bayesian model selection point of view. That is, we place a Gaussian process prior on the computer model and consider the $2^p$ models that result from assuming that each of the subsets of the $p$ inputs affect the response. The goal is to obtain the posterior probabilities of each of these models. In this talk, we focus on the specification of objective priors on the model-specific parameters and on convenient ways to compute the associated marginal likelihoods. These two problems that normally are seen as unrelated, have challenging connections since the priors proposed in the literature are specifically designed to have posterior modes in the boundary of the parameter space, hence precluding the application of approximate integration techniques based on e.g. Laplace approximations. We explore several ways of circumventing this difficulty, comparing different methodologies with synthetic examples taken from the literature.
Authors: Gonzalo Garcia-Donato (Universidad de Castilla-La Mancha) and Rui Paulo (Universidade de Lisboa)
In this work, we propose to apply trust region optimization to deep reinforcement
learning using a recently proposed Kronecker-factored approximation to
the curvature. We extend the framework of natural policy gradient and propose
to optimize both the actor and the critic using Kronecker-factored approximate
curvature (K-FAC) with trust region; hence we call our method Actor Critic using
Kronecker-Factored Trust Region (ACKTR). To the best of our knowledge, this
is the first scalable trust region natural gradient method for actor-critic methods.
It is also a method that learns non-trivial tasks in continuous control as well as
discrete control policies directly from raw pixel inputs. We tested our approach
across discrete domains in Atari games as well as continuous domains in the MuJoCo
environment. With the proposed methods, we are able to achieve higher
rewards and a 2- to 3-fold improvement in sample efficiency on average, compared
to previous state-of-the-art on-policy actor-critic methods. Code is available at
https://github.com/openai/baselines.
Chap 8. Optimization for training deep modelsYoung-Geun Choi
연구실 내부 세미나 자료. Goodfellow et al. (2016), Deep Learning, MIT Press의 Chapter 8을 요약/발췌하였습니다. 깊은 신경망(deep neural network) 모형 훈련시 목적함수 최적화 방법으로 흔히 사용되는 방법들을 소개합니다.
Andres hernandez ai_machine_learning_london_nov2017Andres Hernandez
My slides from the AI & Machine Learning in Quantitative Finance conference in London. I train a neural network to train another neural network to optimize particular black boxes
Learning to discover monte carlo algorithm on spin ice manifoldKai-Wen Zhao
The global update Monte Carlo sampler can be discovered naturally by trained machine using policy gradient method on topologically constrained environment.
Alpine Data Labs presents a deep dive into our implementation of Multinomial Logistic Regression with Apache Spark. Machine Learning Engineer DB Tsai takes us through the technical implementation details step by step. First, he explains how the state of the art Machine Learning on Hadoop is not doing fulfilling the promise of Big Data. Next, he explains how Spark is a perfect match for machine learning through their in-memory cache-ing capability demonstrating 100x performance improvement. Third, he takes us through each aspect of a multinomial logistic regression and how this is developed with Spark APIs. Fourth, he demonstrates an extension of MLOR and training parameters. Finally, he benchmarks MLOR with 11M rows, 123 features, 11% non-zero elements with a 5 node Hadoop cluster. Finally, he shows Alpine's unique visual environment with Spark and verifies the performance with the job tracker. In conclusion, Alpine supports the state of the art Cloudera and Pivotal Hadoop clusters and performances at a level that far exceeds its next nearest competitor.
Multinomial Logistic Regression with Apache SparkDB Tsai
Logistic Regression can not only be used for modeling binary outcomes but also multinomial outcome with some extension. In this talk, DB will talk about basic idea of binary logistic regression step by step, and then extend to multinomial one. He will show how easy it's with Spark to parallelize this iterative algorithm by utilizing the in-memory RDD cache to scale horizontally (the numbers of training data.) However, there is mathematical limitation on scaling vertically (the numbers of training features) while many recent applications from document classification and computational linguistics are of this type. He will talk about how to address this problem by L-BFGS optimizer instead of Newton optimizer.
Bio:
DB Tsai is a machine learning engineer working at Alpine Data Labs. He is recently working with Spark MLlib team to add support of L-BFGS optimizer and multinomial logistic regression in the upstream. He also led the Apache Spark development at Alpine Data Labs. Before joining Alpine Data labs, he was working on large-scale optimization of optical quantum circuits at Stanford as a PhD student.
Explaining the Basics of Mean Field Variational Approximation for StatisticiansWayne Lee
Explaining "Explaining Variational Approximation" by JT Ormerod and MP Wand (2010).
I wanted to learn variational methods since its speed for Bayesian inference is just so fast! Here's my condensed version of the paper without the cool examples...you should really try the examples out if you want a better understanding of this method!
This presentation assumes some knowledge or experience with Bayesian methods.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Generating a custom Ruby SDK for your web service or Rails API using Smithyg2nightmarescribd
Have you ever wanted a Ruby client API to communicate with your web service? Smithy is a protocol-agnostic language for defining services and SDKs. Smithy Ruby is an implementation of Smithy that generates a Ruby SDK using a Smithy model. In this talk, we will explore Smithy and Smithy Ruby to learn how to generate custom feature-rich SDKs that can communicate with any web service, such as a Rails JSON API.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
3. Motivation
Gradient estimation is at the core of much of modern machine learning
Backprop is mechanical: this us more freedom to tinker with architectures
and loss functions
RL (and certain probabilistic models) involve components you can’t
backpropagate through—need REINFORCE-style estimators logprob *
advantage
Often we need a combination of backprop-style and REINFORCE-style
gradient estimation
RNN policies π(at | ht), where ht = f (ht−1, at−1, ot)
Instead of reward, we have a known differentiable function, e.g. classifier
with hard attention. Objective = logprob of correct label.
Each paper to propose such a model provides a new convoluted derivation
4. Gradients of Expectations
Want to compute θE [F]. Where’s θ?
“Inside” distribution, e.g., Ex∼p(· | θ) [F(x)]
θEx [f (x)] = Ex [f (x) θ log px (x; θ)] .
Score function (SF) estimator
Example: REINFORCE policy gradients, where x is the trajectory
Inside expectation: Ez∼N(0,1) [F(θ, z)]
θEz [f (x(z, θ))] = Ez [ θf (x(z, θ))] .
Pathwise derivative (PD) estimator
Example: neural net with dropout
Often, we can reparametrize, to change from one form to another
Not a property of problem: equivalent ways of writing down the same
stochastic process
What if F depends on θ in complicated way, affecting distribution and F?
M. C. Fu. “Gradient estimation”. In: Handbooks in operations research and management science 13 (2006), pp. 575–616
5. Stochastic Computation Graphs
Stochastic computation graph is a DAG, each node corresponds to a
deterministic or stochastic operation
Can automatically derive unbiased gradient estimators, with variance
reduction
John Schulman1, Nicolas Heess2, Théophane Weber2,
1University of California, Berkeley 2Google Deep
Generalize backpropagation to deal with random variables
that we can’t differentiate through.
Why can’t we differentiate through them?
1) discrete random variables
2) unmodeled external world, in RL / control
Computation Graphs Stochastic Computation Graphs
stochastic node
Motivation / Applications
policy gradients in
reinforcement learning
variational inference
distribution is different from the distribution we are evaluating: for parameter ✓ 2 ⇥, ✓ = ✓old is
used for sampling, but we are evaluating at ✓ = ✓new.
Ev c | ✓new
[ˆc] = Ev c | ✓old
2
6
6
4ˆc
Y
v c,
✓ D
v
Pv(v | DEPSv✓, ✓new)
Pv(v | DEPSv✓, ✓old)
3
7
7
5 (17)
Ev c | ✓old
2
6
6
4ˆc
0
B
B
@log
0
B
B
@
Y
v c,
✓ D
v
Pv(v | DEPSv✓, ✓new)
Pv(v | DEPSv✓, ✓old)
1
C
C
A + 1
1
C
C
A
3
7
7
5 (18)
where the second line used the inequality x log x + 1, and the sign is reversed since ˆc is negative.
Summing over c 2 C and rearranging we get
ES | ✓new
"
X
c2C
ˆc
#
ES | ✓old
"
X
c2C
ˆc +
X
v2S
log
✓
p(v | DEPSv✓, ✓new)
p(v | DEPSv✓, ✓old)
◆
ˆQv
#
(19)
= ES | ✓old
"
X
v2S
log p(v | DEPSv✓, ✓new) ˆQv
#
+ const (20)
Equation (20) allows for majorization-minimization algorithms (like the EM algorithm) to be used
to optimize with respect to ✓. In fact, similar equations have been derived by interpreting rewards
(negative costs) as probabilities, and then taking the variational lower bound on log-probability (e.g.,
[24]).
C Examples
C.1 Generalized EM Algorithm and Variational Inference.
The generalized EM algorithm maximizes likelihood in a probabilistic model with latent variables
[18]. Suppose the probabilistic model defines a probability distribution p(x, z; ✓) where x is ob-
served, z is a latent variable, and ✓ is a parameter of the distribution. The generalized EM algorithm
it’s all about gradients of expectations!
Just Differ
@
@✓
E
2
6
4
X
cost node c
c
3
7
5 =
X
stochastic
determini
influence
baseline(
=
@
@✓
2
6
6
6
6
6
6
4
X
stochastic node x
deterministically
influenced by ✓
0
B
B
B
@
log p(x
Under certain condition
cost, related to variation
Equivalently, use backp
Overview
L L
Why’s it useful?
1) no need to rederive every time
2) enable generic software
θ
φ
base
=
@
@✓
2
6
6
6
6
6
6
4
X
stochastic node x
deterministically
influenced by ✓
0
B
B
B
@
lo
Score function estimator:
@
@✓
Ex
Pathwise derivative estim
@
@✓
Ez
@a
@✓
@ log p(b | a, d)
@a
@ log p(b | a, d)
@a
@c
@b
@e
@d
@ log p(d | )
@
(c + e)
@c
@b
@b
@d
@ log p(b | a, d)
@d
c
@d
@
✓
@ log p(b | a, d)
@d
c +
@e
@d
◆
2
@a
@✓
@ log p(b | a, d)
@a
c
@ log p(b | a, d)
@a
c
@c
@b
@e
@d
@ log p(d | )
@
(c + e)
@c
@b
@b
@d
@ log p(b | a, d)
@d
c
@d
@
✓
@ log p(b | a, d)
@d
c +
@e
@d
◆
2
J. Schulman, N. Heess, T. Weber, et al. “Gradient Estimation Using Stochastic Computation Graphs”. In: NIPS. 2015
6. Example: Hard Attention
Look at image (o1)
Determine crop window (a)
Get cropped image (o2)
Version 1
Output label (y)
Get reward of 1 for correct label
Version 2
Output parameter specifying distribution over label (κ)
Reward = prob or logprob of correct label (log p)
o1 o2
θ
a y r
o1 o2
θ
a κ log p
7. Abstract Example
a
d
cb
e
θ
φ
L = c + e. Want to compute d
dθ E [L] and d
dφ E [L].
Treat stochastic node b) as constants, and introduce losses
logprob ∗ (futurecost − optionalbaseline) at each stochastic node
Obtain unbiased gradient estimate by differentiating surrogate:
Surrogate(θ, ψ) = c + e
(1)
+ log p(ˆb | a, d)ˆc
(2)
(1): how parameters influence cost through deterministic dependencies
(2): how parameters affect distribution over random variables.
8. Results
General method for calculating gradient estimator
Surrogate loss version: each stochastic node gives a loss expression
logprob(node value) * (sum of “downstream” costs treated as constants).
Add these to the sum of costs treated as variables.
Alternative generalized backprop: as usual, do reverse topological sort.
Instead of backpropagating using Jacobian, backpropagate gradlogprob *
futurecosts.
Generalized version of baselines for variance reduction (without bias):
logp(node) * (sum of downstream costs treated as constants minus
baseline(nondescendents of node))
10. Deriving the Policy Gradient, Reparameterized
Episodic MDP:
θ
s1 s2 . . . sT
a1 a2 . . . aT
RT
Want to compute θE [RT ]. We’ll use θ log π(at | st; θ)
11. Deriving the Policy Gradient, Reparameterized
Episodic MDP:
θ
s1 s2 . . . sT
a1 a2 . . . aT
RT
Want to compute θE [RT ]. We’ll use θ log π(at | st; θ)
Reparameterize: at = π(st, zt; θ). zt is noise from fixed distribution.
θ
s1 s2 . . . sT
a1 a2 . . . aT
z1 z2 . . . zT
RT
12. Deriving the Policy Gradient, Reparameterized
Episodic MDP:
θ
s1 s2 . . . sT
a1 a2 . . . aT
RT
Want to compute θE [RT ]. We’ll use θ log π(at | st; θ)
Reparameterize: at = π(st, zt; θ). zt is noise from fixed distribution.
θ
s1 s2 . . . sT
a1 a2 . . . aT
z1 z2 . . . zT
RT
Only works if P(s2 | s1, a1) is known ¨
13. Using a Q-function
θ
s1 s2 . . . sT
a1 a2 . . . aT
z1 z2 . . . zT
RT
d
dθ
E [RT ] = E
T
t=1
dRT
dat
dat
dθ
= E
T
t=1
d
dat
E [RT | at]
dat
dθ
= E
T
t=1
dQ(st, at)
dat
dat
dθ
= E
T
t=1
d
dθ
Q(st, π(st, zt; θ))
14. SVG(0) Algorithm
Learn Qφ to approximate Qπ,γ
, and use it to compute gradient estimates.
N. Heess, G. Wayne, D. Silver, et al. “Learning continuous control policies by stochastic value gradients”. In: NIPS. 2015
15. SVG(0) Algorithm
Learn Qφ to approximate Qπ,γ
, and use it to compute gradient estimates.
Pseudocode:
for iteration=1, 2, . . . do
Execute policy πθ to collect T timesteps of data
Update πθ using g ∝ θ
T
t=1 Q(st, π(st, zt; θ))
Update Qφ using g ∝ φ
T
t=1(Qφ(st, at) − ˆQt)2
, e.g. with TD(λ)
end for
N. Heess, G. Wayne, D. Silver, et al. “Learning continuous control policies by stochastic value gradients”. In: NIPS. 2015
16. SVG(1) Algorithm
θ
s1 s2 . . . sT
a1 a2 . . . aT
z1 z2 . . . zT
RT
Instead of learning Q, we learn
State-value function V ≈ V π,γ
Dynamics model f , approximating st+1 = f (st, at) + ζt
Given transition (st, at, st+1), infer ζt = st+1 − f (st, at)
Q(st, at) = E [rt + γV (st+1)] = E [rt + γV (f (st, at) + ζt)], and at = π(st, θ, ζt)
17. SVG(∞) Algorithm
θ
s1 s2 . . . sT
a1 a2 . . . aT
z1 z2 . . . zT
RT
Just learn dynamics model f
Given whole trajectory, infer all noise variables
Freeze all policy and dynamics noise, differentiate through entire deterministic
computation graph
18. SVG Results
Applied to 2D robotics tasks
Overall: different gradient estimators behave similarly
N. Heess, G. Wayne, D. Silver, et al. “Learning continuous control policies by stochastic value gradients”. In: NIPS. 2015
19. Deterministic Policy Gradient
For Gaussian actions, variance of score function policy gradient estimator goes to
infinity as variance goes to zero
But SVG(0) gradient is fine when σ → 0
θ
t
Q(st, π(st, θ, ζt))
Problem: there’s no exploration.
Solution: add noise to the policy, but estimate Q with TD(0), so it’s valid
off-policy
Policy gradient is a little biased (even with Q = Qπ), but only because state
distribution is off—it gets the right gradient at every state
D. Silver, G. Lever, N. Heess, et al. “Deterministic policy gradient algorithms”. In: ICML. 2014
20. Deep Deterministic Policy Gradient
Incorporate replay buffer and target network ideas from DQN for increased
stability
T. P. Lillicrap, J. J. Hunt, A. Pritzel, et al. “Continuous control with deep reinforcement learning”. In: ICLR (2015)
21. Deep Deterministic Policy Gradient
Incorporate replay buffer and target network ideas from DQN for increased
stability
Use lagged (Polyak-averaging) version of Qφ and πθ for fitting Qφ (towards
Qπ,γ
) with TD(0)
ˆQt = rt + γQφ (st+1, π(st+1; θ ))
T. P. Lillicrap, J. J. Hunt, A. Pritzel, et al. “Continuous control with deep reinforcement learning”. In: ICLR (2015)
22. Deep Deterministic Policy Gradient
Incorporate replay buffer and target network ideas from DQN for increased
stability
Use lagged (Polyak-averaging) version of Qφ and πθ for fitting Qφ (towards
Qπ,γ
) with TD(0)
ˆQt = rt + γQφ (st+1, π(st+1; θ ))
Pseudocode:
for iteration=1, 2, . . . do
Act for several timesteps, add data to replay buffer
Sample minibatch
Update πθ using g ∝ θ
T
t=1 Q(st, π(st, zt; θ))
Update Qφ using g ∝ φ
T
t=1(Qφ(st, at) − ˆQt)2
,
end for
T. P. Lillicrap, J. J. Hunt, A. Pritzel, et al. “Continuous control with deep reinforcement learning”. In: ICLR (2015)
23. DDPG Results
Applied to 2D and 3D robotics tasks and driving with pixel input
T. P. Lillicrap, J. J. Hunt, A. Pritzel, et al. “Continuous control with deep reinforcement learning”. In: ICLR (2015)
24. Policy Gradient Methods: Comparison
Y. Duan, X. Chen, R. Houthooft, et al. “Benchmarking Deep Reinforcement Learning for Continuous Control”. In: ICML (2016)