SlideShare a Scribd company logo
1 of 59
Download to read offline
A Stein Variational Framework for Deep Probabilistic
Modeling
Qiang Liu
Dartmouth College (→ UT Austin)
Liu et al. (Dartmouth) December 12, 2017 1 / 47
Probabilistic Modeling for Machine Learning
Modern machine learning = Complex data + Complex models
Complex data {xi } Complex models p(x)
Liu et al. (Dartmouth) December 12, 2017 2 / 47
Unnormalized Distributions
In practice, many distributions have unnormalized densities:
p(x) =
1
Z
¯p(x), Z = ¯p(x)dx.
Z: normalization constant, critically difficult to calculate!
Widely appear in
Bayesian inference,
Probabilistic graphical models,
Deep energy-based models,
Log-linear models,
and many more ...
Highly difficult to learn, sample and evaluate.
Liu et al. (Dartmouth) December 12, 2017 3 / 47
Scalable computational algorithms are the key.
Can benefit from integrating tools in different areas ...
Liu et al. (Dartmouth) December 12, 2017 4 / 47
This Talk
This talk focuses on the inference (sampling) problem:
Given p, find {xi } to approximation p.
Two applications:
Policy optimization in reinforcement learning.
Training neural networks to generate natural images.
Liu et al. (Dartmouth) December 12, 2017 5 / 47
Classical Methods for Inference (Sampling)
Sampling: Given p, find {xi } to approximation p.
Monte Carlo / Markov chain Monte Carlo (MCMC):
Simulate random points.
Asymptotically “correct”, but slow.
Variational inference:
Approximate p with a simpler qθ (e.g., Gaussian): minθ∈Θ KL(qθ || p).
Need parametric assumption: fast, but “wrong”.
Optimization (maximum a posteriori (MAP)):
Find a single point approximation: x∗
= arg max p(x).
Faster, local optima, no uncertainty assessment.
Liu et al. (Dartmouth) December 12, 2017 6 / 47
Stein Variational Gradient Descent (SVGD) [Liu Wang,
2016]
Directly minimize the Kullback-Leibler (KL) divergence between {xi }
and p:
min
{xi }
KL(q || p)
where q is empirical distribution q(x) = n
i=1 δ(x − xi )/n.
An ill-posed problem? KL(q || p) = Ex∼q[log(q/p)] = ∞.
Turns out to be doable, with some new insights...
KL divergence is infinite, but its “gradient” is not...
Liu et al. (Dartmouth) December 12, 2017 7 / 47
Stein Variational Gradient Descent (SVGD) [Liu Wang, 2016]
Idea: Iteratively move {xi }n
i=1 towards the
target p by updates of form
xi ← xi + φ(xi ),
: step-size. φ: a perturbation direction
chosen to maximally decrease the KL di-
vergence with p:
φ = arg max
φ∈F
KL(q || p)
old particles
− KL(q[ φ] || p)
updated particles
where q[ φ] is the density of x = x + φ(x) when the density of x is q.
Think q as the empirical distribution of {xi }: q(x) = n
i=1 δ(x − xi )/n.
Liu et al. (Dartmouth) December 12, 2017 8 / 47
Stein Variational Gradient Descent (SVGD) [Liu Wang, 2016]
Idea: Iteratively move {xi }n
i=1 towards the
target p by updates of form
xi ← xi + φ(xi ),
: step-size. φ: a perturbation direction
chosen to maximally decrease the KL di-
vergence with p:
φ = arg max
φ∈F
KL(q || p) − KL(q[ φ] || p)
≈ arg max
φ∈F
−
∂
∂
KL(q[ φ] || p) =0
, //when step size is small
where q[ φ] is the density of x = x + φ(x) when the density of x is q.
Think q as the empirical distribution of {xi }: q(x) = n
i=1 δ(x − xi )/n.
Liu et al. (Dartmouth) December 12, 2017 8 / 47
Stein Variational Gradient Descent (SVGD) [Liu Wang, 2016]
Key: the objective is a simple, linear functional of φ:
−
∂
∂
KL(q[ φ] || p) =0
= Ex∼q[Tpφ(x)].
where Tp is a linear operator called Stein operator related to p:
Tpφ(x)
def
= x log p(x), φ(x) + x , φ(x) .
Liu et al. (Dartmouth) December 12, 2017 9 / 47
Stein Variational Gradient Descent (SVGD) [Liu Wang, 2016]
Key: the objective is a simple, linear functional of φ:
−
∂
∂
KL(q[ φ] || p) =0
= Ex∼q[Tpφ(x)].
where Tp is a linear operator called Stein operator related to p:
Tpφ(x)
def
= x log p(x)
score function
, φ(x) + x , φ(x) .
Score function x log p(x) = x p(x)
p(x) , independent of the normalization
constant Z!
Liu et al. (Dartmouth) December 12, 2017 9 / 47
Stein Variational Gradient Descent (SVGD) [Liu Wang, 2016]
Key: the objective is a simple, linear functional of φ:
−
∂
∂
KL(q[ φ] || p) =0
= Ex∼q[Tpφ(x)].
where Tp is a linear operator called Stein operator related to p:
Tpφ(x)
def
= x log p(x), φ(x) + x , φ(x) .
Stein’s identity (if p=q):
Ex∼p[Tpφ(x)] = Ex∼p[ x log p(x) + x · φ]
= (φ(x) p(x) + p(x) x · φ(x))dx
= 0 (integration by parts).
Liu et al. (Dartmouth) December 12, 2017 9 / 47
Stein Variational Gradient Descent (SVGD) [Liu Wang, 2016]
Key: the objective is a simple, linear functional of φ:
−
∂
∂
KL(q[ φ] || p) =0
= Ex∼q[Tpφ(x)].
where Tp is a linear operator called Stein operator related to p:
Tpφ(x)
def
= x log p(x), φ(x) + x , φ(x) .
Stein’s method: a set of theoretical
techniques for proving fundamental
approximation bounds and limits (such as
central limit theorem) in probability theory.
A large body of theoretical work. Known
to be “remarkably powerful”.
Liu et al. (Dartmouth) December 12, 2017 9 / 47
Stein Discrepancy
The optimization is equivalent to
D(q || p)
def
= max
φ∈F
Eq[Tpφ]
where D(q || p) is called Stein discrepancy: D(q || p) = 0 iff q = p if F is
“large” enough.
Liu et al. (Dartmouth) December 12, 2017 10 / 47
Stein Discrepancy
The optimization is equivalent to
D(q || p)
def
= max
φ∈F
Eq[Tpφ]
where D(q || p) is called Stein discrepancy: D(q || p) = 0 iff q = p if F is
“large” enough.
The choice of F is critical.
Traditional Stein discrepancy is not computable: casts challenging
infinite dimensional functional optimizations.
Imposing constraints only on finite numbers of points [Gorham, Mackey 15; Gorham et al.
16]
Obtaining closed form solution using reproducing kernel Hilbert space [Liu et al.
16; Chwialkowski et al. 16; Oates et al. 14; Gorham, Mackey 17]
Liu et al. (Dartmouth) December 12, 2017 10 / 47
Stein Discrepancy
The optimization is equivalent to
D(q || p)
def
= max
φ∈F
Eq[Tpφ]
where D(q || p) is called Stein discrepancy: D(q || p) = 0 iff q = p if F is
“large” enough.
The choice of F is critical.
Traditional Stein discrepancy is not computable: casts challenging
infinite dimensional functional optimizations.
Imposing constraints only on finite numbers of points [Gorham, Mackey 15; Gorham et al.
16]
Obtaining closed form solution using reproducing kernel Hilbert space [Liu et al.
16; Chwialkowski et al. 16; Oates et al. 14; Gorham, Mackey 17]
Liu et al. (Dartmouth) December 12, 2017 10 / 47
Kernel Stein Discrepancy [Liu et al. 16; Chwialkowski et al. 16]
Computable Stein discrepancy using kernel:
Take F to be the unit ball of any reproducing kernel Hilbert space (RKHS)
H, with positive kernel k(x, x ):
D(q || p)
def
= max
φ∈H
Eq[Tpφ] s.t. ||φ||H ≤ 1
Closed-form solution:
φ∗
(·) ∝ Ex∼q[Tpk(x, ·)]
= Ex∼q[ x log p(x)k(x, ·) + k(x, ·)]
Kernel Stein Discrepancy:
D(q, p)2
= Ex,x ∼q[T x
p T x
p k(x, x )]
T x
p , T x
p : Stein operator w.r.t. variable x, x .
Liu et al. (Dartmouth) December 12, 2017 11 / 47
Kernel Stein Discrepancy [Liu et al. 16; Chwialkowski et al. 16]
Computable Stein discrepancy using kernel:
Take F to be the unit ball of any reproducing kernel Hilbert space (RKHS)
H, with positive kernel k(x, x ):
D(q || p)
def
= max
φ∈H
Eq[Tpφ] s.t. ||φ||H ≤ 1
Closed-form solution:
φ∗
(·) ∝ Ex∼q[Tpk(x, ·)]
= Ex∼q[ x log p(x)k(x, ·) + k(x, ·)]
Kernel Stein Discrepancy:
D(q, p)2
= Ex,x ∼q[T x
p T x
p k(x, x )]
T x
p , T x
p : Stein operator w.r.t. variable x, x .
Liu et al. (Dartmouth) December 12, 2017 11 / 47
Kernel Stein Discrepancy
Kernel Stein discrepancy provides a computational tool for comparing
samples {xi } (from unknown q) with unnormalized models p:
D({xi }, p)2 def
=
1
n2
ij
T x
p T x
p k(xi , xj ).
Applications:
Goodness-of-fit test for unnormalized
distributions [Liu et al. 16; Chwialkowski et al. 16].
Black-box importance sampling [Liu, Lee. 16]:
importance weights for samples from unknown
distributions by minimizing Stein discrepancy,
with super-efficient convergence rates.
Liu et al. (Dartmouth) December 12, 2017 12 / 47
Stein Variational Gradient Descent
SVGD: Approximating Ex∼q[·] with empirical averaging ˆEx∼{xi }n
i=1
[·] over
the current points:
xi ← xi + ˆEx∼{xi }n
i=1
[ x logp(x)k(x, xi ) + x k(x, xi )], ∀i = 1, . . . , n.
Iteratively move particles {xi } to fit p.
Liu et al. (Dartmouth) December 12, 2017 13 / 47
Stein Variational Gradient Descent
SVGD: iteratively update {xi } until convergence:
xi ← xi + ˆEx∼{xi }n
i=1
[ x logp(x)k(x, xi )
weighted sum of gradient
+ x k(x, xi )
repulsive force
], ∀i = 1, . . . , n.
Two terms:
x logp(x): moves the particles {xi }
towards high probability regions of
p(x).
Nearby particles share gradient with
weighted sum.
x k(x, x ): enforces diversity in {xi }
(otherwise all xi collapse to modes of
p(x)).
Liu et al. (Dartmouth) December 12, 2017 14 / 47
Stein Variational Gradient Descent
SVGD: iteratively update {xi } until convergence:
xi ← xi + ˆEx∼{xi }n
i=1
[ x logp(x)k(x, xi )
weighted sum of gradient
+ x k(x, xi )
repulsive force
], ∀i = 1, . . . , n.
Two terms:
x logp(x): moves the particles {xi }
towards high probability regions of
p(x).
Nearby particles share gradient with
weighted sum.
x k(x, x ): enforces diversity in {xi }
(otherwise all xi collapse to modes of
p(x)).
Liu et al. (Dartmouth) December 12, 2017 14 / 47
Stein Variational Gradient Descent
Distribution p defined as a 2D Gaussian mixture on the pixels.
Liu et al. (Dartmouth) December 12, 2017 15 / 47
SVGD vs. MAP and Monte Carlo
xi ← xi + ˆEx∼{xi }n
i=1
[ x logp(x)
gradient
k(x, xi ) + x k(x, xi )
repulsive force
], ∀i = 1, . . . , n.
When using a single particle (n = 1), SVGD reduces to standard
gradient ascent for maxx log p(x) (i.e., maximum a posteriori (MAP)):
x ← x + x log p(x).
MAP (SVGD with n = 1): already performs well in many practical cases.
Typical Monte Carlo / MCMC: perform worse when n = 1.
Liu et al. (Dartmouth) December 12, 2017 16 / 47
SVGD as Gradient Flow of KL Divergence [Liu 2016, arXiv:1704.07520]
The empirical measures of the particles weakly converge to the solution of
a nonlinear Fokker-Planck equation, that is a gradient flow of KL
divergence:
∂
∂t
qt = − · (φ∗
qt ,pqt) = −gradKL(qt || p),
which decreases KL divergence monotonically
d
dt
KL(qt || p) = −D(qt, p)2
.
Liu et al. (Dartmouth) December 12, 2017 17 / 47
SVGD as Gradient Flow of KL Divergence [Liu 2016, arXiv:1704.07520]
The empirical measures of the particles weakly converge to the solution of
a nonlinear Fokker-Planck equation, that is a gradient flow of KL
divergence:
∂
∂t
qt(x) = −gradKL(qt || p),
gradKL(q || p) is a functional gradient defined w.r.t. a new notion of
distance between distributions.
The minimum cost of trans-
porting the mass of q to p.
A new geometry structure on
the space of distributions.
Liu et al. (Dartmouth) December 12, 2017 18 / 47
Bayesian Logistic Regression
Stein Variational Gradient Descent (Our Method)
Stochastic Langevin (Parallel SGLD)
Particle Mirror Descent (PMD)
Doubly Stochastic (DSVI)
Stochastic Langevin (Sequential SGLD)
0.1 1 2
Number of Epoches
0.65
0.7
0.75
TestingAccuracy
1 10 50 250
Particle Size (n)
0.65
0.7
0.75
TestingAccuracy
(a) Results with particle size n = 100 (b) Results at the 3000th iteration
Liu et al. (Dartmouth) December 12, 2017 19 / 47
Bayesian Neural Network
Test Bayesian neural nets on benchmark datasets.
Used 20 particles.
Compared with probabilistic back propagation (PBP)
[Hernandez-Lobato et al. 2015]
Avg. Test RMSE Avg. Test LL Avg. Time (Secs)
Dataset PBP Our Method PBP Our Method PBP Ours
Boston 2.977 ± 0.093 2.957 ± 0.0992.957 ± 0.0992.957 ± 0.099 −2.579 ± 0.052 −2.504 ± 0.029−2.504 ± 0.029−2.504 ± 0.029 18 161616
Concrete 5.506 ± 0.103 5.324 ± 0.1045.324 ± 0.1045.324 ± 0.104 −3.137 ± 0.021 −3.082 ± 0.018−3.082 ± 0.018−3.082 ± 0.018 33 242424
Energy 1.734 ± 0.051 1.374 ± 0.0451.374 ± 0.0451.374 ± 0.045 −1.981 ± 0.028 −1.767 ± 0.024−1.767 ± 0.024−1.767 ± 0.024 25 212121
Kin8nm 0.098 ± 0.001 0.090 ± 0.0010.090 ± 0.0010.090 ± 0.001 0.901 ± 0.010 0.984 ± 0.0080.984 ± 0.0080.984 ± 0.008 118 414141
Naval 0.006 ± 0.000 0.004 ± 0.0000.004 ± 0.0000.004 ± 0.000 3.735 ± 0.004 4.089 ± 0.0124.089 ± 0.0124.089 ± 0.012 173 494949
Combined 4.052 ± 0.031 4.033 ± 0.0334.033 ± 0.0334.033 ± 0.033 −2.819 ± 0.008 −2.815 ± 0.008−2.815 ± 0.008−2.815 ± 0.008 136 515151
Protein 4.623 ± 0.009 4.606 ± 0.0134.606 ± 0.0134.606 ± 0.013 −2.950 ± 0.002 −2.947 ± 0.003−2.947 ± 0.003−2.947 ± 0.003 682 686868
Wine 0.614 ± 0.008 0.609 ± 0.0100.609 ± 0.0100.609 ± 0.010 −0.931 ± 0.014 −0.925 ± 0.014−0.925 ± 0.014−0.925 ± 0.014 26 222222
Yacht 0.778 ± 0.0420.778 ± 0.0420.778 ± 0.042 0.864 ± 0.052 −1.211 ± 0.044−1.211 ± 0.044−1.211 ± 0.044 −1.225 ± 0.042 25 25
Year 8.733 ± NA 8.684 ± NA8.684 ± NA8.684 ± NA −3.586 ± NA −3.580 ± NA−3.580 ± NA−3.580 ± NA 7777 684684684
Liu et al. (Dartmouth) December 12, 2017 20 / 47
SVGD as a Search Heuristic
Particles collaborate to explore large space.
Can be used to solve challenging non-convex optimization problems.
Application: Policy optimization in deep reinforcement learning.
Liu et al. (Dartmouth) December 12, 2017 21 / 47
A Very Quick Intro to Reinforcement Learning
Agents take actions a based on
observed states s, and receive
reward r.
Policy πθ(a|s), parameterized by θ.
Goal: find optimal policy πθ(a|s)
to maximize the expected reward:
max
θ
J(θ) = E[r(s, a) | πθ].
Viewed as a black-box optimization.
Liu et al. (Dartmouth) December 12, 2017 22 / 47
Model-Free Policy Gradient
Model-free policy gradient methods:
Estimate the gradient (without knowing the transition and reward
model), and perform gradient descent:
θ ← θ + θJ(θ).
Different methods for gradient estimation:
Finite difference methods.
Likelihood ratio methods: REINFORCE, etc.
Actor-critic methods: Advantage Actor-Critic (A2C), etc.
Liu et al. (Dartmouth) December 12, 2017 23 / 47
Model-Free Policy Gradient
Advantages:
Better convergence, work for high dimensional, continuous control tasks.
Impressive results on Atari games, vision-based navigation, etc.
Challenges:
Converge to local optima.
High variance in gradient estimation.
Liu et al. (Dartmouth) December 12, 2017 24 / 47
Stein Variational Policy Gradient [Liu et al. 17, arXiv:1704.02399]
Stein variational policy gradient: find a group of {θi } by
θi ← θi +
n
n
j=1
[ θj
J(θj )k(θj , θi )
gradient sharing
+ α θj
k(θj , θi )
repulsive force
]
Similar to collective behaviors in swarm intelligence.
Liu et al. (Dartmouth) December 12, 2017 25 / 47
Stein Variational Policy Gradient [Liu et al. 17, arXiv:1704.02399]
Stein variational policy gradient: find a group of {θi } by
θi ← θi +
n
n
j=1
[ θj
J(θj )k(θj , θi )
gradient sharing
+ α θj
k(θj , θi )
repulsive force
]
Can be viewed as sampling {θi } from a Boltzmann distribution:
p(θ) ∝ exp(
1
α
J(θ))
α : temperature parameter.
Liu et al. (Dartmouth) December 12, 2017 25 / 47
Stein Variational Policy Gradient [Liu et al. 17, arXiv:1704.02399]
Stein variational policy gradient: find a group of {θi } by
θi ← θi +
n
n
j=1
[ θj
J(θj )k(θj , θi )
gradient sharing
+ α θj
k(θj , θi )
repulsive force
]
Can be viewed as sampling {θi } from a Boltzmann distribution:
p(θ) ∝ exp(
1
α
J(θ)) = arg max
q
Eq[J(θ)] + αH(q) .
entropy regularization
encourage exploration
α : temperature parameter. H(q): entropy.
Liu et al. (Dartmouth) December 12, 2017 25 / 47
REINFORCE-SVPG: Stein variational gradient (n = 16 agents).
REINFORCE-Independent: n independent gradient descent agents.
REINFORCE-Joint: a single agent, using n times as many data per iteration.
Liu et al. (Dartmouth) December 12, 2017 26 / 47
A2C-SVPG: Stein variational gradient (n = 16 agents).
A2C-Independent: n independent gradient descent agents.
A2C-Joint: a single agent, using n times as many data per iteration.
Liu et al. (Dartmouth) December 12, 2017 27 / 47
Average returns of the policies given by SVGD (blue) and independent
A2C (red), for Cartpole Swing Up.
Liu et al. (Dartmouth) December 12, 2017 28 / 47
State visitation density of the top 4 policies given by SVGD (upper) and
independent REINFORCE (lower), for Cartpole Swing Up.
Liu et al. (Dartmouth) December 12, 2017 29 / 47
Swimmer
Liu et al. (Dartmouth) December 12, 2017 30 / 47
Top Four Policies by SVPG
Liu et al. (Dartmouth) December 12, 2017 31 / 47
Stein Variational Gradient Descent
SVGD: a simple, efficient algorithm for sampling and non-convex
optimization.
Liu et al. (Dartmouth) December 12, 2017 32 / 47
Amortized SVGD: Learning to Sample
SVGD is designed for sampling individual distributions.
What if we need to solve many similar inference problems repeatedly?
Posterior inference for different users, images, documents, etc.
sampling as inner loops of all other algorithms.
We should not solve each problem from scratch.
Amortized SVGD: train feedforward neural networks to learn to draw
samples by mimicking the SVGD dynamics.
Liu et al. (Dartmouth) December 12, 2017 33 / 47
Learning to Sample
Problem formulation:
Given p and a neural net f (η, ξ) with parameter η and random input ξ.
Find η such that the random output x = f (η, ξ) approximates
distribution p.
Critically challenging to solve, when the structure of f and input ξ is
complex, or even unknown (black-box).
Progresses made only very recently:
Amortized SVGD: sidestep the difficulty using Stein variational gradient.
Other recent works: [Ranganath et al. 16, Mescheder et al. 17, Li et al. 17]
.
Liu et al. (Dartmouth) December 12, 2017 34 / 47
Application: Variational Autoencoder [Feng + 17, Fu + 17]
Given observed {xobs,i }, learn latent variable
model:
pθ(x) = pθ(x, z)dz.
x: observed variable;
z: missing variable;
θ: model parameter.
Maximum likelihood estimate of θ by EM.
Difficulty: Need to sample from the posterior distribution pθ(z|xobs,i ) at
each iteration, for each xobs,i .
Amortized inference: Construct an “encoder”: z = Gη(ξ, x), such
that z ∼ pθ(z|x) [Kingma, Welling 13].
Liu et al. (Dartmouth) December 12, 2017 35 / 47
Application: Learning Un-normalized Distributions
and GAN
Given observed {xobs,i }n
i=1, want to learn energy-based model:
pθ(x) =
1
Z
exp(ψθ(x)),
ψθ(x): a neural net.
Zθ: normalization constant.
Classical method: estimating θ by maximum likelihood.
Difficulty: log Zθ is intractable; requires to sample from pθ at every
iteration to approximate the gradient.
Amortized inference: Amortizing the generation of the negative
samples yields GAN-style algorithms [Kim & Bengio16, Liu+ 16, Zhai+ 16].
Liu et al. (Dartmouth) December 12, 2017 36 / 47
Application: Meta-Learning for Speeding up Bayesian
Inference
Bayesian inference: given data D, and unknown random parameter z,
sample posterior p(z|D).
Traditional MCMC: can be viewed as hand-crafted simulators Gη, with
hyper-parameter η.
Amortized inference: can be used to optimize the hyper-parameters of
MCMC, adaptively improving the performance when processing lots of
similar datasets.
Liu et al. (Dartmouth) December 12, 2017 37 / 47
Application: Reinforcement Learning with Deep
Energy-base Policies [Haarnoja+ 17]
Maximum entropy policy: pθ(a|s) ∝ exp( 1
α Q(s, a)).
Implementing the policy requires drawing samples from pθ(a|s)
repeatedly, at each iteration.
SVGD as a Search Heuristic
Particles collaborate to explore large space.
Can be used to solve challenging non-convex optimization problems.
Application: Policy optimization in deep reinforcement learning.
Liu et al. (Dartmouth) May 1, 2017 21 / 49
Amortized Inference: construct generator Gη(ξ) (an implementable
policy) to sample from pθ(a|s).
Liu et al. (Dartmouth) December 12, 2017 38 / 47
Amortized SVGD
Amortized SVGD: Iteratively adjust η to make the output move along
the Stein variational gradient direction.
Liu et al. (Dartmouth) December 12, 2017 39 / 47
Amortized SVGD
Amortized SVGD: Iteratively adjust η to make the output move along
the Stein variational gradient direction.
Liu et al. (Dartmouth) December 12, 2017 39 / 47
Liu et al. (Dartmouth) December 12, 2017 40 / 47
Liu et al. (Dartmouth) December 12, 2017 40 / 47
Amortized SVGD for Learning energy-based models: Given observed
data {xobs,i }n
i=1, want to learn model pθ(x):
pθ(x) =
1
Z
exp(ψθ(x)), Z = exp(ψθ(x))dx.
Deep energy model (when ψθ(x) is a neural net), graphical models, etc.
Classical method: estimating θ by maximizing the likelihood:
max
θ
L(θ) ≡ ˆEobs[log pθ(x)] .
Gradient: θL(θ) = ˆEobs[∂θψθ(x)]
Average on observed data
− Epθ
[∂θψθ(x)]
Expectation on model pθ
Difficulty: requires to sample from p(x|θ) at every iteration.
Liu et al. (Dartmouth) December 12, 2017 41 / 47
Difficulty: requires to sample from p(x|θ) at every iteration.
Gradient: θL(θ) = ˆEobs[∂θψθ(x)]
Average on observed data
− Epθ
[∂θψθ(x)]
Expectation on model pθ
G(Z)
Random  seed      
f(⌘, ⇠)
⇠
Liu et al. (Dartmouth) December 12, 2017 42 / 47
Amortized MLE as an Adversarial Game
Can be treated as an adversarial process between the energy model and the neural
sampler.
Similar to generative adversarial networks (GAN) [Goodfellow et al., 2014].
Liu et al. (Dartmouth) December 12, 2017 43 / 47
Real images Generated by Stein neural sampler
Liu et al. (Dartmouth) December 12, 2017 44 / 47
It captures the semantics of the data distribution.
Changing the random input ξ smoothly.
Liu et al. (Dartmouth) December 12, 2017 45 / 47
Thank You
Powered by SVGD
Liu et al. (Dartmouth) December 12, 2017 46 / 47
References I
I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative
adversarial nets. In Advances in Neural Information Processing Systems, pages 2672–2680, 2014.
Liu et al. (Dartmouth) December 12, 2017 47 / 47

More Related Content

What's hot

Maximum likelihood estimation of regularisation parameters in inverse problem...
Maximum likelihood estimation of regularisation parameters in inverse problem...Maximum likelihood estimation of regularisation parameters in inverse problem...
Maximum likelihood estimation of regularisation parameters in inverse problem...Valentin De Bortoli
 
Can we estimate a constant?
Can we estimate a constant?Can we estimate a constant?
Can we estimate a constant?Christian Robert
 
Delayed acceptance for Metropolis-Hastings algorithms
Delayed acceptance for Metropolis-Hastings algorithmsDelayed acceptance for Metropolis-Hastings algorithms
Delayed acceptance for Metropolis-Hastings algorithmsChristian Robert
 
ABC with Wasserstein distances
ABC with Wasserstein distancesABC with Wasserstein distances
ABC with Wasserstein distancesChristian Robert
 
Coordinate sampler: A non-reversible Gibbs-like sampler
Coordinate sampler: A non-reversible Gibbs-like samplerCoordinate sampler: A non-reversible Gibbs-like sampler
Coordinate sampler: A non-reversible Gibbs-like samplerChristian Robert
 
Bayesian hybrid variable selection under generalized linear models
Bayesian hybrid variable selection under generalized linear modelsBayesian hybrid variable selection under generalized linear models
Bayesian hybrid variable selection under generalized linear modelsCaleb (Shiqiang) Jin
 
Approximate Bayesian Computation with Quasi-Likelihoods
Approximate Bayesian Computation with Quasi-LikelihoodsApproximate Bayesian Computation with Quasi-Likelihoods
Approximate Bayesian Computation with Quasi-LikelihoodsStefano Cabras
 
Probabilistic Control of Switched Linear Systems with Chance Constraints
Probabilistic Control of Switched Linear Systems with Chance ConstraintsProbabilistic Control of Switched Linear Systems with Chance Constraints
Probabilistic Control of Switched Linear Systems with Chance ConstraintsLeo Asselborn
 
Information in the Weights
Information in the WeightsInformation in the Weights
Information in the WeightsMark Chang
 
QMC Error SAMSI Tutorial Aug 2017
QMC Error SAMSI Tutorial Aug 2017QMC Error SAMSI Tutorial Aug 2017
QMC Error SAMSI Tutorial Aug 2017Fred J. Hickernell
 
comments on exponential ergodicity of the bouncy particle sampler
comments on exponential ergodicity of the bouncy particle samplercomments on exponential ergodicity of the bouncy particle sampler
comments on exponential ergodicity of the bouncy particle samplerChristian Robert
 
Unbiased Bayes for Big Data
Unbiased Bayes for Big DataUnbiased Bayes for Big Data
Unbiased Bayes for Big DataChristian Robert
 

What's hot (20)

Maximum likelihood estimation of regularisation parameters in inverse problem...
Maximum likelihood estimation of regularisation parameters in inverse problem...Maximum likelihood estimation of regularisation parameters in inverse problem...
Maximum likelihood estimation of regularisation parameters in inverse problem...
 
Can we estimate a constant?
Can we estimate a constant?Can we estimate a constant?
Can we estimate a constant?
 
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
 
Delayed acceptance for Metropolis-Hastings algorithms
Delayed acceptance for Metropolis-Hastings algorithmsDelayed acceptance for Metropolis-Hastings algorithms
Delayed acceptance for Metropolis-Hastings algorithms
 
ABC with Wasserstein distances
ABC with Wasserstein distancesABC with Wasserstein distances
ABC with Wasserstein distances
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
 
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
 
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
 
Coordinate sampler: A non-reversible Gibbs-like sampler
Coordinate sampler: A non-reversible Gibbs-like samplerCoordinate sampler: A non-reversible Gibbs-like sampler
Coordinate sampler: A non-reversible Gibbs-like sampler
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
 
Bayesian hybrid variable selection under generalized linear models
Bayesian hybrid variable selection under generalized linear modelsBayesian hybrid variable selection under generalized linear models
Bayesian hybrid variable selection under generalized linear models
 
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
 
Approximate Bayesian Computation with Quasi-Likelihoods
Approximate Bayesian Computation with Quasi-LikelihoodsApproximate Bayesian Computation with Quasi-Likelihoods
Approximate Bayesian Computation with Quasi-Likelihoods
 
Probabilistic Control of Switched Linear Systems with Chance Constraints
Probabilistic Control of Switched Linear Systems with Chance ConstraintsProbabilistic Control of Switched Linear Systems with Chance Constraints
Probabilistic Control of Switched Linear Systems with Chance Constraints
 
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
 
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
 
Information in the Weights
Information in the WeightsInformation in the Weights
Information in the Weights
 
QMC Error SAMSI Tutorial Aug 2017
QMC Error SAMSI Tutorial Aug 2017QMC Error SAMSI Tutorial Aug 2017
QMC Error SAMSI Tutorial Aug 2017
 
comments on exponential ergodicity of the bouncy particle sampler
comments on exponential ergodicity of the bouncy particle samplercomments on exponential ergodicity of the bouncy particle sampler
comments on exponential ergodicity of the bouncy particle sampler
 
Unbiased Bayes for Big Data
Unbiased Bayes for Big DataUnbiased Bayes for Big Data
Unbiased Bayes for Big Data
 

Similar to QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop, A Stein Variational Approach for Deep Probabilistic Modeling - Qiang Liu, Dec 12, 2017

IVR - Chapter 1 - Introduction
IVR - Chapter 1 - IntroductionIVR - Chapter 1 - Introduction
IVR - Chapter 1 - IntroductionCharles Deledalle
 
Divergence clustering
Divergence clusteringDivergence clustering
Divergence clusteringFrank Nielsen
 
"That scripting language called Prolog"
"That scripting language called Prolog""That scripting language called Prolog"
"That scripting language called Prolog"Sergei Winitzki
 
Slides: A glance at information-geometric signal processing
Slides: A glance at information-geometric signal processingSlides: A glance at information-geometric signal processing
Slides: A glance at information-geometric signal processingFrank Nielsen
 
Divergence center-based clustering and their applications
Divergence center-based clustering and their applicationsDivergence center-based clustering and their applications
Divergence center-based clustering and their applicationsFrank Nielsen
 
Locality-sensitive hashing for search in metric space
Locality-sensitive hashing for search in metric space Locality-sensitive hashing for search in metric space
Locality-sensitive hashing for search in metric space Eliezer Silva
 
Mathematics notes and formula for class 12 chapter 7. integrals
Mathematics notes and formula for class 12 chapter 7. integrals Mathematics notes and formula for class 12 chapter 7. integrals
Mathematics notes and formula for class 12 chapter 7. integrals sakhi pathak
 
Wide-Coverage CCG Parsing with Quantifier Scope
Wide-Coverage CCG Parsing with Quantifier ScopeWide-Coverage CCG Parsing with Quantifier Scope
Wide-Coverage CCG Parsing with Quantifier Scopedimkart
 
Lesson 27: Integration by Substitution (Section 041 slides)
Lesson 27: Integration by Substitution (Section 041 slides)Lesson 27: Integration by Substitution (Section 041 slides)
Lesson 27: Integration by Substitution (Section 041 slides)Mel Anthony Pepito
 
Lesson 27: Integration by Substitution (Section 041 slides)
Lesson 27: Integration by Substitution (Section 041 slides)Lesson 27: Integration by Substitution (Section 041 slides)
Lesson 27: Integration by Substitution (Section 041 slides)Matthew Leingang
 
High-Performance Approach to String Similarity using Most Frequent K Characters
High-Performance Approach to String Similarity using Most Frequent K CharactersHigh-Performance Approach to String Similarity using Most Frequent K Characters
High-Performance Approach to String Similarity using Most Frequent K CharactersHolistic Benchmarking of Big Linked Data
 
Parallel Bayesian Optimization
Parallel Bayesian OptimizationParallel Bayesian Optimization
Parallel Bayesian OptimizationSri Ambati
 
Testing for mixtures at BNP 13
Testing for mixtures at BNP 13Testing for mixtures at BNP 13
Testing for mixtures at BNP 13Christian Robert
 
A common fixed point theorem in cone metric spaces
A common fixed point theorem in cone metric spacesA common fixed point theorem in cone metric spaces
A common fixed point theorem in cone metric spacesAlexander Decker
 

Similar to QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop, A Stein Variational Approach for Deep Probabilistic Modeling - Qiang Liu, Dec 12, 2017 (20)

IVR - Chapter 1 - Introduction
IVR - Chapter 1 - IntroductionIVR - Chapter 1 - Introduction
IVR - Chapter 1 - Introduction
 
SASA 2016
SASA 2016SASA 2016
SASA 2016
 
Deep Learning Opening Workshop - ProxSARAH Algorithms for Stochastic Composit...
Deep Learning Opening Workshop - ProxSARAH Algorithms for Stochastic Composit...Deep Learning Opening Workshop - ProxSARAH Algorithms for Stochastic Composit...
Deep Learning Opening Workshop - ProxSARAH Algorithms for Stochastic Composit...
 
Divergence clustering
Divergence clusteringDivergence clustering
Divergence clustering
 
"That scripting language called Prolog"
"That scripting language called Prolog""That scripting language called Prolog"
"That scripting language called Prolog"
 
Slides: A glance at information-geometric signal processing
Slides: A glance at information-geometric signal processingSlides: A glance at information-geometric signal processing
Slides: A glance at information-geometric signal processing
 
Divergence center-based clustering and their applications
Divergence center-based clustering and their applicationsDivergence center-based clustering and their applications
Divergence center-based clustering and their applications
 
Locality-sensitive hashing for search in metric space
Locality-sensitive hashing for search in metric space Locality-sensitive hashing for search in metric space
Locality-sensitive hashing for search in metric space
 
Mathematics notes and formula for class 12 chapter 7. integrals
Mathematics notes and formula for class 12 chapter 7. integrals Mathematics notes and formula for class 12 chapter 7. integrals
Mathematics notes and formula for class 12 chapter 7. integrals
 
Wide-Coverage CCG Parsing with Quantifier Scope
Wide-Coverage CCG Parsing with Quantifier ScopeWide-Coverage CCG Parsing with Quantifier Scope
Wide-Coverage CCG Parsing with Quantifier Scope
 
Lesson 27: Integration by Substitution (Section 041 slides)
Lesson 27: Integration by Substitution (Section 041 slides)Lesson 27: Integration by Substitution (Section 041 slides)
Lesson 27: Integration by Substitution (Section 041 slides)
 
Lesson 27: Integration by Substitution (Section 041 slides)
Lesson 27: Integration by Substitution (Section 041 slides)Lesson 27: Integration by Substitution (Section 041 slides)
Lesson 27: Integration by Substitution (Section 041 slides)
 
High-Performance Approach to String Similarity using Most Frequent K Characters
High-Performance Approach to String Similarity using Most Frequent K CharactersHigh-Performance Approach to String Similarity using Most Frequent K Characters
High-Performance Approach to String Similarity using Most Frequent K Characters
 
Proba stats-r1-2017
Proba stats-r1-2017Proba stats-r1-2017
Proba stats-r1-2017
 
Parallel Bayesian Optimization
Parallel Bayesian OptimizationParallel Bayesian Optimization
Parallel Bayesian Optimization
 
E42012426
E42012426E42012426
E42012426
 
Puy chosuantai2
Puy chosuantai2Puy chosuantai2
Puy chosuantai2
 
Testing for mixtures at BNP 13
Testing for mixtures at BNP 13Testing for mixtures at BNP 13
Testing for mixtures at BNP 13
 
Lecture2 xing
Lecture2 xingLecture2 xing
Lecture2 xing
 
A common fixed point theorem in cone metric spaces
A common fixed point theorem in cone metric spacesA common fixed point theorem in cone metric spaces
A common fixed point theorem in cone metric spaces
 

More from The Statistical and Applied Mathematical Sciences Institute

More from The Statistical and Applied Mathematical Sciences Institute (20)

Causal Inference Opening Workshop - Latent Variable Models, Causal Inference,...
Causal Inference Opening Workshop - Latent Variable Models, Causal Inference,...Causal Inference Opening Workshop - Latent Variable Models, Causal Inference,...
Causal Inference Opening Workshop - Latent Variable Models, Causal Inference,...
 
2019 Fall Series: Special Guest Lecture - 0-1 Phase Transitions in High Dimen...
2019 Fall Series: Special Guest Lecture - 0-1 Phase Transitions in High Dimen...2019 Fall Series: Special Guest Lecture - 0-1 Phase Transitions in High Dimen...
2019 Fall Series: Special Guest Lecture - 0-1 Phase Transitions in High Dimen...
 
Causal Inference Opening Workshop - Causal Discovery in Neuroimaging Data - F...
Causal Inference Opening Workshop - Causal Discovery in Neuroimaging Data - F...Causal Inference Opening Workshop - Causal Discovery in Neuroimaging Data - F...
Causal Inference Opening Workshop - Causal Discovery in Neuroimaging Data - F...
 
Causal Inference Opening Workshop - Smooth Extensions to BART for Heterogeneo...
Causal Inference Opening Workshop - Smooth Extensions to BART for Heterogeneo...Causal Inference Opening Workshop - Smooth Extensions to BART for Heterogeneo...
Causal Inference Opening Workshop - Smooth Extensions to BART for Heterogeneo...
 
Causal Inference Opening Workshop - A Bracketing Relationship between Differe...
Causal Inference Opening Workshop - A Bracketing Relationship between Differe...Causal Inference Opening Workshop - A Bracketing Relationship between Differe...
Causal Inference Opening Workshop - A Bracketing Relationship between Differe...
 
Causal Inference Opening Workshop - Testing Weak Nulls in Matched Observation...
Causal Inference Opening Workshop - Testing Weak Nulls in Matched Observation...Causal Inference Opening Workshop - Testing Weak Nulls in Matched Observation...
Causal Inference Opening Workshop - Testing Weak Nulls in Matched Observation...
 
Causal Inference Opening Workshop - Difference-in-differences: more than meet...
Causal Inference Opening Workshop - Difference-in-differences: more than meet...Causal Inference Opening Workshop - Difference-in-differences: more than meet...
Causal Inference Opening Workshop - Difference-in-differences: more than meet...
 
Causal Inference Opening Workshop - New Statistical Learning Methods for Esti...
Causal Inference Opening Workshop - New Statistical Learning Methods for Esti...Causal Inference Opening Workshop - New Statistical Learning Methods for Esti...
Causal Inference Opening Workshop - New Statistical Learning Methods for Esti...
 
Causal Inference Opening Workshop - Bipartite Causal Inference with Interfere...
Causal Inference Opening Workshop - Bipartite Causal Inference with Interfere...Causal Inference Opening Workshop - Bipartite Causal Inference with Interfere...
Causal Inference Opening Workshop - Bipartite Causal Inference with Interfere...
 
Causal Inference Opening Workshop - Bridging the Gap Between Causal Literatur...
Causal Inference Opening Workshop - Bridging the Gap Between Causal Literatur...Causal Inference Opening Workshop - Bridging the Gap Between Causal Literatur...
Causal Inference Opening Workshop - Bridging the Gap Between Causal Literatur...
 
Causal Inference Opening Workshop - Some Applications of Reinforcement Learni...
Causal Inference Opening Workshop - Some Applications of Reinforcement Learni...Causal Inference Opening Workshop - Some Applications of Reinforcement Learni...
Causal Inference Opening Workshop - Some Applications of Reinforcement Learni...
 
Causal Inference Opening Workshop - Bracketing Bounds for Differences-in-Diff...
Causal Inference Opening Workshop - Bracketing Bounds for Differences-in-Diff...Causal Inference Opening Workshop - Bracketing Bounds for Differences-in-Diff...
Causal Inference Opening Workshop - Bracketing Bounds for Differences-in-Diff...
 
Causal Inference Opening Workshop - Assisting the Impact of State Polcies: Br...
Causal Inference Opening Workshop - Assisting the Impact of State Polcies: Br...Causal Inference Opening Workshop - Assisting the Impact of State Polcies: Br...
Causal Inference Opening Workshop - Assisting the Impact of State Polcies: Br...
 
Causal Inference Opening Workshop - Experimenting in Equilibrium - Stefan Wag...
Causal Inference Opening Workshop - Experimenting in Equilibrium - Stefan Wag...Causal Inference Opening Workshop - Experimenting in Equilibrium - Stefan Wag...
Causal Inference Opening Workshop - Experimenting in Equilibrium - Stefan Wag...
 
Causal Inference Opening Workshop - Targeted Learning for Causal Inference Ba...
Causal Inference Opening Workshop - Targeted Learning for Causal Inference Ba...Causal Inference Opening Workshop - Targeted Learning for Causal Inference Ba...
Causal Inference Opening Workshop - Targeted Learning for Causal Inference Ba...
 
Causal Inference Opening Workshop - Bayesian Nonparametric Models for Treatme...
Causal Inference Opening Workshop - Bayesian Nonparametric Models for Treatme...Causal Inference Opening Workshop - Bayesian Nonparametric Models for Treatme...
Causal Inference Opening Workshop - Bayesian Nonparametric Models for Treatme...
 
2019 Fall Series: Special Guest Lecture - Adversarial Risk Analysis of the Ge...
2019 Fall Series: Special Guest Lecture - Adversarial Risk Analysis of the Ge...2019 Fall Series: Special Guest Lecture - Adversarial Risk Analysis of the Ge...
2019 Fall Series: Special Guest Lecture - Adversarial Risk Analysis of the Ge...
 
2019 Fall Series: Professional Development, Writing Academic Papers…What Work...
2019 Fall Series: Professional Development, Writing Academic Papers…What Work...2019 Fall Series: Professional Development, Writing Academic Papers…What Work...
2019 Fall Series: Professional Development, Writing Academic Papers…What Work...
 
2019 GDRR: Blockchain Data Analytics - Machine Learning in/for Blockchain: Fu...
2019 GDRR: Blockchain Data Analytics - Machine Learning in/for Blockchain: Fu...2019 GDRR: Blockchain Data Analytics - Machine Learning in/for Blockchain: Fu...
2019 GDRR: Blockchain Data Analytics - Machine Learning in/for Blockchain: Fu...
 
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...
 

Recently uploaded

Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxQ4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxlancelewisportillo
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxHumphrey A Beña
 
Scientific Writing :Research Discourse
Scientific  Writing :Research  DiscourseScientific  Writing :Research  Discourse
Scientific Writing :Research DiscourseAnita GoswamiGiri
 
Mythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITWMythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITWQuiz Club NITW
 
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...DhatriParmar
 
ClimART Action | eTwinning Project
ClimART Action    |    eTwinning ProjectClimART Action    |    eTwinning Project
ClimART Action | eTwinning Projectjordimapav
 
MS4 level being good citizen -imperative- (1) (1).pdf
MS4 level   being good citizen -imperative- (1) (1).pdfMS4 level   being good citizen -imperative- (1) (1).pdf
MS4 level being good citizen -imperative- (1) (1).pdfMr Bounab Samir
 
Measures of Position DECILES for ungrouped data
Measures of Position DECILES for ungrouped dataMeasures of Position DECILES for ungrouped data
Measures of Position DECILES for ungrouped dataBabyAnnMotar
 
How to Fix XML SyntaxError in Odoo the 17
How to Fix XML SyntaxError in Odoo the 17How to Fix XML SyntaxError in Odoo the 17
How to Fix XML SyntaxError in Odoo the 17Celine George
 
Expanded definition: technical and operational
Expanded definition: technical and operationalExpanded definition: technical and operational
Expanded definition: technical and operationalssuser3e220a
 
4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptxmary850239
 
ROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxVanesaIglesias10
 
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...Association for Project Management
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)lakshayb543
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptxmary850239
 
Q-Factor General Quiz-7th April 2024, Quiz Club NITW
Q-Factor General Quiz-7th April 2024, Quiz Club NITWQ-Factor General Quiz-7th April 2024, Quiz Club NITW
Q-Factor General Quiz-7th April 2024, Quiz Club NITWQuiz Club NITW
 

Recently uploaded (20)

Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxQ4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
 
Scientific Writing :Research Discourse
Scientific  Writing :Research  DiscourseScientific  Writing :Research  Discourse
Scientific Writing :Research Discourse
 
prashanth updated resume 2024 for Teaching Profession
prashanth updated resume 2024 for Teaching Professionprashanth updated resume 2024 for Teaching Profession
prashanth updated resume 2024 for Teaching Profession
 
Mythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITWMythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITW
 
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...
 
ClimART Action | eTwinning Project
ClimART Action    |    eTwinning ProjectClimART Action    |    eTwinning Project
ClimART Action | eTwinning Project
 
MS4 level being good citizen -imperative- (1) (1).pdf
MS4 level   being good citizen -imperative- (1) (1).pdfMS4 level   being good citizen -imperative- (1) (1).pdf
MS4 level being good citizen -imperative- (1) (1).pdf
 
Faculty Profile prashantha K EEE dept Sri Sairam college of Engineering
Faculty Profile prashantha K EEE dept Sri Sairam college of EngineeringFaculty Profile prashantha K EEE dept Sri Sairam college of Engineering
Faculty Profile prashantha K EEE dept Sri Sairam college of Engineering
 
Measures of Position DECILES for ungrouped data
Measures of Position DECILES for ungrouped dataMeasures of Position DECILES for ungrouped data
Measures of Position DECILES for ungrouped data
 
How to Fix XML SyntaxError in Odoo the 17
How to Fix XML SyntaxError in Odoo the 17How to Fix XML SyntaxError in Odoo the 17
How to Fix XML SyntaxError in Odoo the 17
 
Paradigm shift in nursing research by RS MEHTA
Paradigm shift in nursing research by RS MEHTAParadigm shift in nursing research by RS MEHTA
Paradigm shift in nursing research by RS MEHTA
 
Expanded definition: technical and operational
Expanded definition: technical and operationalExpanded definition: technical and operational
Expanded definition: technical and operational
 
INCLUSIVE EDUCATION PRACTICES FOR TEACHERS AND TRAINERS.pptx
INCLUSIVE EDUCATION PRACTICES FOR TEACHERS AND TRAINERS.pptxINCLUSIVE EDUCATION PRACTICES FOR TEACHERS AND TRAINERS.pptx
INCLUSIVE EDUCATION PRACTICES FOR TEACHERS AND TRAINERS.pptx
 
4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx
 
ROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptx
 
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx
 
Q-Factor General Quiz-7th April 2024, Quiz Club NITW
Q-Factor General Quiz-7th April 2024, Quiz Club NITWQ-Factor General Quiz-7th April 2024, Quiz Club NITW
Q-Factor General Quiz-7th April 2024, Quiz Club NITW
 

QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop, A Stein Variational Approach for Deep Probabilistic Modeling - Qiang Liu, Dec 12, 2017

  • 1. A Stein Variational Framework for Deep Probabilistic Modeling Qiang Liu Dartmouth College (→ UT Austin) Liu et al. (Dartmouth) December 12, 2017 1 / 47
  • 2. Probabilistic Modeling for Machine Learning Modern machine learning = Complex data + Complex models Complex data {xi } Complex models p(x) Liu et al. (Dartmouth) December 12, 2017 2 / 47
  • 3. Unnormalized Distributions In practice, many distributions have unnormalized densities: p(x) = 1 Z ¯p(x), Z = ¯p(x)dx. Z: normalization constant, critically difficult to calculate! Widely appear in Bayesian inference, Probabilistic graphical models, Deep energy-based models, Log-linear models, and many more ... Highly difficult to learn, sample and evaluate. Liu et al. (Dartmouth) December 12, 2017 3 / 47
  • 4. Scalable computational algorithms are the key. Can benefit from integrating tools in different areas ... Liu et al. (Dartmouth) December 12, 2017 4 / 47
  • 5. This Talk This talk focuses on the inference (sampling) problem: Given p, find {xi } to approximation p. Two applications: Policy optimization in reinforcement learning. Training neural networks to generate natural images. Liu et al. (Dartmouth) December 12, 2017 5 / 47
  • 6. Classical Methods for Inference (Sampling) Sampling: Given p, find {xi } to approximation p. Monte Carlo / Markov chain Monte Carlo (MCMC): Simulate random points. Asymptotically “correct”, but slow. Variational inference: Approximate p with a simpler qθ (e.g., Gaussian): minθ∈Θ KL(qθ || p). Need parametric assumption: fast, but “wrong”. Optimization (maximum a posteriori (MAP)): Find a single point approximation: x∗ = arg max p(x). Faster, local optima, no uncertainty assessment. Liu et al. (Dartmouth) December 12, 2017 6 / 47
  • 7. Stein Variational Gradient Descent (SVGD) [Liu Wang, 2016] Directly minimize the Kullback-Leibler (KL) divergence between {xi } and p: min {xi } KL(q || p) where q is empirical distribution q(x) = n i=1 δ(x − xi )/n. An ill-posed problem? KL(q || p) = Ex∼q[log(q/p)] = ∞. Turns out to be doable, with some new insights... KL divergence is infinite, but its “gradient” is not... Liu et al. (Dartmouth) December 12, 2017 7 / 47
  • 8. Stein Variational Gradient Descent (SVGD) [Liu Wang, 2016] Idea: Iteratively move {xi }n i=1 towards the target p by updates of form xi ← xi + φ(xi ), : step-size. φ: a perturbation direction chosen to maximally decrease the KL di- vergence with p: φ = arg max φ∈F KL(q || p) old particles − KL(q[ φ] || p) updated particles where q[ φ] is the density of x = x + φ(x) when the density of x is q. Think q as the empirical distribution of {xi }: q(x) = n i=1 δ(x − xi )/n. Liu et al. (Dartmouth) December 12, 2017 8 / 47
  • 9. Stein Variational Gradient Descent (SVGD) [Liu Wang, 2016] Idea: Iteratively move {xi }n i=1 towards the target p by updates of form xi ← xi + φ(xi ), : step-size. φ: a perturbation direction chosen to maximally decrease the KL di- vergence with p: φ = arg max φ∈F KL(q || p) − KL(q[ φ] || p) ≈ arg max φ∈F − ∂ ∂ KL(q[ φ] || p) =0 , //when step size is small where q[ φ] is the density of x = x + φ(x) when the density of x is q. Think q as the empirical distribution of {xi }: q(x) = n i=1 δ(x − xi )/n. Liu et al. (Dartmouth) December 12, 2017 8 / 47
  • 10. Stein Variational Gradient Descent (SVGD) [Liu Wang, 2016] Key: the objective is a simple, linear functional of φ: − ∂ ∂ KL(q[ φ] || p) =0 = Ex∼q[Tpφ(x)]. where Tp is a linear operator called Stein operator related to p: Tpφ(x) def = x log p(x), φ(x) + x , φ(x) . Liu et al. (Dartmouth) December 12, 2017 9 / 47
  • 11. Stein Variational Gradient Descent (SVGD) [Liu Wang, 2016] Key: the objective is a simple, linear functional of φ: − ∂ ∂ KL(q[ φ] || p) =0 = Ex∼q[Tpφ(x)]. where Tp is a linear operator called Stein operator related to p: Tpφ(x) def = x log p(x) score function , φ(x) + x , φ(x) . Score function x log p(x) = x p(x) p(x) , independent of the normalization constant Z! Liu et al. (Dartmouth) December 12, 2017 9 / 47
  • 12. Stein Variational Gradient Descent (SVGD) [Liu Wang, 2016] Key: the objective is a simple, linear functional of φ: − ∂ ∂ KL(q[ φ] || p) =0 = Ex∼q[Tpφ(x)]. where Tp is a linear operator called Stein operator related to p: Tpφ(x) def = x log p(x), φ(x) + x , φ(x) . Stein’s identity (if p=q): Ex∼p[Tpφ(x)] = Ex∼p[ x log p(x) + x · φ] = (φ(x) p(x) + p(x) x · φ(x))dx = 0 (integration by parts). Liu et al. (Dartmouth) December 12, 2017 9 / 47
  • 13. Stein Variational Gradient Descent (SVGD) [Liu Wang, 2016] Key: the objective is a simple, linear functional of φ: − ∂ ∂ KL(q[ φ] || p) =0 = Ex∼q[Tpφ(x)]. where Tp is a linear operator called Stein operator related to p: Tpφ(x) def = x log p(x), φ(x) + x , φ(x) . Stein’s method: a set of theoretical techniques for proving fundamental approximation bounds and limits (such as central limit theorem) in probability theory. A large body of theoretical work. Known to be “remarkably powerful”. Liu et al. (Dartmouth) December 12, 2017 9 / 47
  • 14. Stein Discrepancy The optimization is equivalent to D(q || p) def = max φ∈F Eq[Tpφ] where D(q || p) is called Stein discrepancy: D(q || p) = 0 iff q = p if F is “large” enough. Liu et al. (Dartmouth) December 12, 2017 10 / 47
  • 15. Stein Discrepancy The optimization is equivalent to D(q || p) def = max φ∈F Eq[Tpφ] where D(q || p) is called Stein discrepancy: D(q || p) = 0 iff q = p if F is “large” enough. The choice of F is critical. Traditional Stein discrepancy is not computable: casts challenging infinite dimensional functional optimizations. Imposing constraints only on finite numbers of points [Gorham, Mackey 15; Gorham et al. 16] Obtaining closed form solution using reproducing kernel Hilbert space [Liu et al. 16; Chwialkowski et al. 16; Oates et al. 14; Gorham, Mackey 17] Liu et al. (Dartmouth) December 12, 2017 10 / 47
  • 16. Stein Discrepancy The optimization is equivalent to D(q || p) def = max φ∈F Eq[Tpφ] where D(q || p) is called Stein discrepancy: D(q || p) = 0 iff q = p if F is “large” enough. The choice of F is critical. Traditional Stein discrepancy is not computable: casts challenging infinite dimensional functional optimizations. Imposing constraints only on finite numbers of points [Gorham, Mackey 15; Gorham et al. 16] Obtaining closed form solution using reproducing kernel Hilbert space [Liu et al. 16; Chwialkowski et al. 16; Oates et al. 14; Gorham, Mackey 17] Liu et al. (Dartmouth) December 12, 2017 10 / 47
  • 17. Kernel Stein Discrepancy [Liu et al. 16; Chwialkowski et al. 16] Computable Stein discrepancy using kernel: Take F to be the unit ball of any reproducing kernel Hilbert space (RKHS) H, with positive kernel k(x, x ): D(q || p) def = max φ∈H Eq[Tpφ] s.t. ||φ||H ≤ 1 Closed-form solution: φ∗ (·) ∝ Ex∼q[Tpk(x, ·)] = Ex∼q[ x log p(x)k(x, ·) + k(x, ·)] Kernel Stein Discrepancy: D(q, p)2 = Ex,x ∼q[T x p T x p k(x, x )] T x p , T x p : Stein operator w.r.t. variable x, x . Liu et al. (Dartmouth) December 12, 2017 11 / 47
  • 18. Kernel Stein Discrepancy [Liu et al. 16; Chwialkowski et al. 16] Computable Stein discrepancy using kernel: Take F to be the unit ball of any reproducing kernel Hilbert space (RKHS) H, with positive kernel k(x, x ): D(q || p) def = max φ∈H Eq[Tpφ] s.t. ||φ||H ≤ 1 Closed-form solution: φ∗ (·) ∝ Ex∼q[Tpk(x, ·)] = Ex∼q[ x log p(x)k(x, ·) + k(x, ·)] Kernel Stein Discrepancy: D(q, p)2 = Ex,x ∼q[T x p T x p k(x, x )] T x p , T x p : Stein operator w.r.t. variable x, x . Liu et al. (Dartmouth) December 12, 2017 11 / 47
  • 19. Kernel Stein Discrepancy Kernel Stein discrepancy provides a computational tool for comparing samples {xi } (from unknown q) with unnormalized models p: D({xi }, p)2 def = 1 n2 ij T x p T x p k(xi , xj ). Applications: Goodness-of-fit test for unnormalized distributions [Liu et al. 16; Chwialkowski et al. 16]. Black-box importance sampling [Liu, Lee. 16]: importance weights for samples from unknown distributions by minimizing Stein discrepancy, with super-efficient convergence rates. Liu et al. (Dartmouth) December 12, 2017 12 / 47
  • 20. Stein Variational Gradient Descent SVGD: Approximating Ex∼q[·] with empirical averaging ˆEx∼{xi }n i=1 [·] over the current points: xi ← xi + ˆEx∼{xi }n i=1 [ x logp(x)k(x, xi ) + x k(x, xi )], ∀i = 1, . . . , n. Iteratively move particles {xi } to fit p. Liu et al. (Dartmouth) December 12, 2017 13 / 47
  • 21. Stein Variational Gradient Descent SVGD: iteratively update {xi } until convergence: xi ← xi + ˆEx∼{xi }n i=1 [ x logp(x)k(x, xi ) weighted sum of gradient + x k(x, xi ) repulsive force ], ∀i = 1, . . . , n. Two terms: x logp(x): moves the particles {xi } towards high probability regions of p(x). Nearby particles share gradient with weighted sum. x k(x, x ): enforces diversity in {xi } (otherwise all xi collapse to modes of p(x)). Liu et al. (Dartmouth) December 12, 2017 14 / 47
  • 22. Stein Variational Gradient Descent SVGD: iteratively update {xi } until convergence: xi ← xi + ˆEx∼{xi }n i=1 [ x logp(x)k(x, xi ) weighted sum of gradient + x k(x, xi ) repulsive force ], ∀i = 1, . . . , n. Two terms: x logp(x): moves the particles {xi } towards high probability regions of p(x). Nearby particles share gradient with weighted sum. x k(x, x ): enforces diversity in {xi } (otherwise all xi collapse to modes of p(x)). Liu et al. (Dartmouth) December 12, 2017 14 / 47
  • 23. Stein Variational Gradient Descent Distribution p defined as a 2D Gaussian mixture on the pixels. Liu et al. (Dartmouth) December 12, 2017 15 / 47
  • 24. SVGD vs. MAP and Monte Carlo xi ← xi + ˆEx∼{xi }n i=1 [ x logp(x) gradient k(x, xi ) + x k(x, xi ) repulsive force ], ∀i = 1, . . . , n. When using a single particle (n = 1), SVGD reduces to standard gradient ascent for maxx log p(x) (i.e., maximum a posteriori (MAP)): x ← x + x log p(x). MAP (SVGD with n = 1): already performs well in many practical cases. Typical Monte Carlo / MCMC: perform worse when n = 1. Liu et al. (Dartmouth) December 12, 2017 16 / 47
  • 25. SVGD as Gradient Flow of KL Divergence [Liu 2016, arXiv:1704.07520] The empirical measures of the particles weakly converge to the solution of a nonlinear Fokker-Planck equation, that is a gradient flow of KL divergence: ∂ ∂t qt = − · (φ∗ qt ,pqt) = −gradKL(qt || p), which decreases KL divergence monotonically d dt KL(qt || p) = −D(qt, p)2 . Liu et al. (Dartmouth) December 12, 2017 17 / 47
  • 26. SVGD as Gradient Flow of KL Divergence [Liu 2016, arXiv:1704.07520] The empirical measures of the particles weakly converge to the solution of a nonlinear Fokker-Planck equation, that is a gradient flow of KL divergence: ∂ ∂t qt(x) = −gradKL(qt || p), gradKL(q || p) is a functional gradient defined w.r.t. a new notion of distance between distributions. The minimum cost of trans- porting the mass of q to p. A new geometry structure on the space of distributions. Liu et al. (Dartmouth) December 12, 2017 18 / 47
  • 27. Bayesian Logistic Regression Stein Variational Gradient Descent (Our Method) Stochastic Langevin (Parallel SGLD) Particle Mirror Descent (PMD) Doubly Stochastic (DSVI) Stochastic Langevin (Sequential SGLD) 0.1 1 2 Number of Epoches 0.65 0.7 0.75 TestingAccuracy 1 10 50 250 Particle Size (n) 0.65 0.7 0.75 TestingAccuracy (a) Results with particle size n = 100 (b) Results at the 3000th iteration Liu et al. (Dartmouth) December 12, 2017 19 / 47
  • 28. Bayesian Neural Network Test Bayesian neural nets on benchmark datasets. Used 20 particles. Compared with probabilistic back propagation (PBP) [Hernandez-Lobato et al. 2015] Avg. Test RMSE Avg. Test LL Avg. Time (Secs) Dataset PBP Our Method PBP Our Method PBP Ours Boston 2.977 ± 0.093 2.957 ± 0.0992.957 ± 0.0992.957 ± 0.099 −2.579 ± 0.052 −2.504 ± 0.029−2.504 ± 0.029−2.504 ± 0.029 18 161616 Concrete 5.506 ± 0.103 5.324 ± 0.1045.324 ± 0.1045.324 ± 0.104 −3.137 ± 0.021 −3.082 ± 0.018−3.082 ± 0.018−3.082 ± 0.018 33 242424 Energy 1.734 ± 0.051 1.374 ± 0.0451.374 ± 0.0451.374 ± 0.045 −1.981 ± 0.028 −1.767 ± 0.024−1.767 ± 0.024−1.767 ± 0.024 25 212121 Kin8nm 0.098 ± 0.001 0.090 ± 0.0010.090 ± 0.0010.090 ± 0.001 0.901 ± 0.010 0.984 ± 0.0080.984 ± 0.0080.984 ± 0.008 118 414141 Naval 0.006 ± 0.000 0.004 ± 0.0000.004 ± 0.0000.004 ± 0.000 3.735 ± 0.004 4.089 ± 0.0124.089 ± 0.0124.089 ± 0.012 173 494949 Combined 4.052 ± 0.031 4.033 ± 0.0334.033 ± 0.0334.033 ± 0.033 −2.819 ± 0.008 −2.815 ± 0.008−2.815 ± 0.008−2.815 ± 0.008 136 515151 Protein 4.623 ± 0.009 4.606 ± 0.0134.606 ± 0.0134.606 ± 0.013 −2.950 ± 0.002 −2.947 ± 0.003−2.947 ± 0.003−2.947 ± 0.003 682 686868 Wine 0.614 ± 0.008 0.609 ± 0.0100.609 ± 0.0100.609 ± 0.010 −0.931 ± 0.014 −0.925 ± 0.014−0.925 ± 0.014−0.925 ± 0.014 26 222222 Yacht 0.778 ± 0.0420.778 ± 0.0420.778 ± 0.042 0.864 ± 0.052 −1.211 ± 0.044−1.211 ± 0.044−1.211 ± 0.044 −1.225 ± 0.042 25 25 Year 8.733 ± NA 8.684 ± NA8.684 ± NA8.684 ± NA −3.586 ± NA −3.580 ± NA−3.580 ± NA−3.580 ± NA 7777 684684684 Liu et al. (Dartmouth) December 12, 2017 20 / 47
  • 29. SVGD as a Search Heuristic Particles collaborate to explore large space. Can be used to solve challenging non-convex optimization problems. Application: Policy optimization in deep reinforcement learning. Liu et al. (Dartmouth) December 12, 2017 21 / 47
  • 30. A Very Quick Intro to Reinforcement Learning Agents take actions a based on observed states s, and receive reward r. Policy πθ(a|s), parameterized by θ. Goal: find optimal policy πθ(a|s) to maximize the expected reward: max θ J(θ) = E[r(s, a) | πθ]. Viewed as a black-box optimization. Liu et al. (Dartmouth) December 12, 2017 22 / 47
  • 31. Model-Free Policy Gradient Model-free policy gradient methods: Estimate the gradient (without knowing the transition and reward model), and perform gradient descent: θ ← θ + θJ(θ). Different methods for gradient estimation: Finite difference methods. Likelihood ratio methods: REINFORCE, etc. Actor-critic methods: Advantage Actor-Critic (A2C), etc. Liu et al. (Dartmouth) December 12, 2017 23 / 47
  • 32. Model-Free Policy Gradient Advantages: Better convergence, work for high dimensional, continuous control tasks. Impressive results on Atari games, vision-based navigation, etc. Challenges: Converge to local optima. High variance in gradient estimation. Liu et al. (Dartmouth) December 12, 2017 24 / 47
  • 33. Stein Variational Policy Gradient [Liu et al. 17, arXiv:1704.02399] Stein variational policy gradient: find a group of {θi } by θi ← θi + n n j=1 [ θj J(θj )k(θj , θi ) gradient sharing + α θj k(θj , θi ) repulsive force ] Similar to collective behaviors in swarm intelligence. Liu et al. (Dartmouth) December 12, 2017 25 / 47
  • 34. Stein Variational Policy Gradient [Liu et al. 17, arXiv:1704.02399] Stein variational policy gradient: find a group of {θi } by θi ← θi + n n j=1 [ θj J(θj )k(θj , θi ) gradient sharing + α θj k(θj , θi ) repulsive force ] Can be viewed as sampling {θi } from a Boltzmann distribution: p(θ) ∝ exp( 1 α J(θ)) α : temperature parameter. Liu et al. (Dartmouth) December 12, 2017 25 / 47
  • 35. Stein Variational Policy Gradient [Liu et al. 17, arXiv:1704.02399] Stein variational policy gradient: find a group of {θi } by θi ← θi + n n j=1 [ θj J(θj )k(θj , θi ) gradient sharing + α θj k(θj , θi ) repulsive force ] Can be viewed as sampling {θi } from a Boltzmann distribution: p(θ) ∝ exp( 1 α J(θ)) = arg max q Eq[J(θ)] + αH(q) . entropy regularization encourage exploration α : temperature parameter. H(q): entropy. Liu et al. (Dartmouth) December 12, 2017 25 / 47
  • 36. REINFORCE-SVPG: Stein variational gradient (n = 16 agents). REINFORCE-Independent: n independent gradient descent agents. REINFORCE-Joint: a single agent, using n times as many data per iteration. Liu et al. (Dartmouth) December 12, 2017 26 / 47
  • 37. A2C-SVPG: Stein variational gradient (n = 16 agents). A2C-Independent: n independent gradient descent agents. A2C-Joint: a single agent, using n times as many data per iteration. Liu et al. (Dartmouth) December 12, 2017 27 / 47
  • 38. Average returns of the policies given by SVGD (blue) and independent A2C (red), for Cartpole Swing Up. Liu et al. (Dartmouth) December 12, 2017 28 / 47
  • 39. State visitation density of the top 4 policies given by SVGD (upper) and independent REINFORCE (lower), for Cartpole Swing Up. Liu et al. (Dartmouth) December 12, 2017 29 / 47
  • 40. Swimmer Liu et al. (Dartmouth) December 12, 2017 30 / 47
  • 41. Top Four Policies by SVPG Liu et al. (Dartmouth) December 12, 2017 31 / 47
  • 42. Stein Variational Gradient Descent SVGD: a simple, efficient algorithm for sampling and non-convex optimization. Liu et al. (Dartmouth) December 12, 2017 32 / 47
  • 43. Amortized SVGD: Learning to Sample SVGD is designed for sampling individual distributions. What if we need to solve many similar inference problems repeatedly? Posterior inference for different users, images, documents, etc. sampling as inner loops of all other algorithms. We should not solve each problem from scratch. Amortized SVGD: train feedforward neural networks to learn to draw samples by mimicking the SVGD dynamics. Liu et al. (Dartmouth) December 12, 2017 33 / 47
  • 44. Learning to Sample Problem formulation: Given p and a neural net f (η, ξ) with parameter η and random input ξ. Find η such that the random output x = f (η, ξ) approximates distribution p. Critically challenging to solve, when the structure of f and input ξ is complex, or even unknown (black-box). Progresses made only very recently: Amortized SVGD: sidestep the difficulty using Stein variational gradient. Other recent works: [Ranganath et al. 16, Mescheder et al. 17, Li et al. 17] . Liu et al. (Dartmouth) December 12, 2017 34 / 47
  • 45. Application: Variational Autoencoder [Feng + 17, Fu + 17] Given observed {xobs,i }, learn latent variable model: pθ(x) = pθ(x, z)dz. x: observed variable; z: missing variable; θ: model parameter. Maximum likelihood estimate of θ by EM. Difficulty: Need to sample from the posterior distribution pθ(z|xobs,i ) at each iteration, for each xobs,i . Amortized inference: Construct an “encoder”: z = Gη(ξ, x), such that z ∼ pθ(z|x) [Kingma, Welling 13]. Liu et al. (Dartmouth) December 12, 2017 35 / 47
  • 46. Application: Learning Un-normalized Distributions and GAN Given observed {xobs,i }n i=1, want to learn energy-based model: pθ(x) = 1 Z exp(ψθ(x)), ψθ(x): a neural net. Zθ: normalization constant. Classical method: estimating θ by maximum likelihood. Difficulty: log Zθ is intractable; requires to sample from pθ at every iteration to approximate the gradient. Amortized inference: Amortizing the generation of the negative samples yields GAN-style algorithms [Kim & Bengio16, Liu+ 16, Zhai+ 16]. Liu et al. (Dartmouth) December 12, 2017 36 / 47
  • 47. Application: Meta-Learning for Speeding up Bayesian Inference Bayesian inference: given data D, and unknown random parameter z, sample posterior p(z|D). Traditional MCMC: can be viewed as hand-crafted simulators Gη, with hyper-parameter η. Amortized inference: can be used to optimize the hyper-parameters of MCMC, adaptively improving the performance when processing lots of similar datasets. Liu et al. (Dartmouth) December 12, 2017 37 / 47
  • 48. Application: Reinforcement Learning with Deep Energy-base Policies [Haarnoja+ 17] Maximum entropy policy: pθ(a|s) ∝ exp( 1 α Q(s, a)). Implementing the policy requires drawing samples from pθ(a|s) repeatedly, at each iteration. SVGD as a Search Heuristic Particles collaborate to explore large space. Can be used to solve challenging non-convex optimization problems. Application: Policy optimization in deep reinforcement learning. Liu et al. (Dartmouth) May 1, 2017 21 / 49 Amortized Inference: construct generator Gη(ξ) (an implementable policy) to sample from pθ(a|s). Liu et al. (Dartmouth) December 12, 2017 38 / 47
  • 49. Amortized SVGD Amortized SVGD: Iteratively adjust η to make the output move along the Stein variational gradient direction. Liu et al. (Dartmouth) December 12, 2017 39 / 47
  • 50. Amortized SVGD Amortized SVGD: Iteratively adjust η to make the output move along the Stein variational gradient direction. Liu et al. (Dartmouth) December 12, 2017 39 / 47
  • 51. Liu et al. (Dartmouth) December 12, 2017 40 / 47
  • 52. Liu et al. (Dartmouth) December 12, 2017 40 / 47
  • 53. Amortized SVGD for Learning energy-based models: Given observed data {xobs,i }n i=1, want to learn model pθ(x): pθ(x) = 1 Z exp(ψθ(x)), Z = exp(ψθ(x))dx. Deep energy model (when ψθ(x) is a neural net), graphical models, etc. Classical method: estimating θ by maximizing the likelihood: max θ L(θ) ≡ ˆEobs[log pθ(x)] . Gradient: θL(θ) = ˆEobs[∂θψθ(x)] Average on observed data − Epθ [∂θψθ(x)] Expectation on model pθ Difficulty: requires to sample from p(x|θ) at every iteration. Liu et al. (Dartmouth) December 12, 2017 41 / 47
  • 54. Difficulty: requires to sample from p(x|θ) at every iteration. Gradient: θL(θ) = ˆEobs[∂θψθ(x)] Average on observed data − Epθ [∂θψθ(x)] Expectation on model pθ G(Z) Random  seed       f(⌘, ⇠) ⇠ Liu et al. (Dartmouth) December 12, 2017 42 / 47
  • 55. Amortized MLE as an Adversarial Game Can be treated as an adversarial process between the energy model and the neural sampler. Similar to generative adversarial networks (GAN) [Goodfellow et al., 2014]. Liu et al. (Dartmouth) December 12, 2017 43 / 47
  • 56. Real images Generated by Stein neural sampler Liu et al. (Dartmouth) December 12, 2017 44 / 47
  • 57. It captures the semantics of the data distribution. Changing the random input ξ smoothly. Liu et al. (Dartmouth) December 12, 2017 45 / 47
  • 58. Thank You Powered by SVGD Liu et al. (Dartmouth) December 12, 2017 46 / 47
  • 59. References I I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems, pages 2672–2680, 2014. Liu et al. (Dartmouth) December 12, 2017 47 / 47