SlideShare a Scribd company logo
1 of 29
Download to read offline
Measuring Sample Quality with Kernels
Lester Mackey∗,‡
Joint work with Jackson Gorham†,‡
Microsoft Research∗
Opendoor†
Stanford University‡
December 12, 2017
Mackey (MSR) Kernel Stein Discrepancy December 12, 2017 1 / 26
Motivation: Large-scale Posterior Inference
Question: How do we scale Markov chain Monte Carlo (MCMC)
posterior inference to massive datasets?
MCMC Benefit: Approximates intractable posterior
expectations EP [h(Z)] = X
p(x)h(x)dx with asymptotically
exact sample estimates EQ[h(X)] = 1
n
n
i=1 h(xi)
Problem: Each point xi requires iterating over entire dataset!
Template solution: Approximate MCMC with subset posteriors
[Welling and Teh, 2011, Ahn, Korattikara, and Welling, 2012, Korattikara, Chen, and Welling, 2014]
Approximate standard MCMC procedure in a manner that makes
use of only a small subset of datapoints per sample
Reduced computational overhead leads to faster sampling and
reduced Monte Carlo variance
Introduces asymptotic bias: target distribution is not stationary
Hope that for fixed amount of sampling time, variance reduction
will outweigh bias introduced
Mackey (MSR) Kernel Stein Discrepancy December 12, 2017 2 / 26
Motivation: Large-scale Posterior Inference
Template solution: Approximate MCMC with subset posteriors
[Welling and Teh, 2011, Ahn, Korattikara, and Welling, 2012, Korattikara, Chen, and Welling, 2014]
Hope that for fixed amount of sampling time, variance reduction
will outweigh bias introduced
Introduces new challenges
How do we compare and evaluate samples from approximate
MCMC procedures?
How do we select samplers and their tuning parameters?
How do we quantify the bias-variance trade-off explicitly?
Difficulty: Standard evaluation criteria like effective sample size,
trace plots, and variance diagnostics assume convergence to the
target distribution and do not account for asymptotic bias
This talk: Introduce new quality measures suitable for comparing
the quality of approximate MCMC samples
Mackey (MSR) Kernel Stein Discrepancy December 12, 2017 3 / 26
Quality Measures for Samples
Challenge: Develop measure suitable for comparing the quality of
any two samples approximating a common target distribution
Given
Continuous target distribution P with support X = Rd
and
density p
p known up to normalization, integration under P is intractable
Sample points x1, . . . , xn ∈ X
Define discrete distribution Qn with, for any function h,
EQn [h(X)] = 1
n
n
i=1 h(xi) used to approximate EP [h(Z)]
We make no assumption about the provenance of the xi
Goal: Quantify how well EQn approximates EP in a manner that
I. Detects when a sample sequence is converging to the target
II. Detects when a sample sequence is not converging to the target
III. Is computationally feasible
Mackey (MSR) Kernel Stein Discrepancy December 12, 2017 4 / 26
Integral Probability Metrics
Goal: Quantify how well EQn approximates EP
Idea: Consider an integral probability metric (IPM) [M¨uller, 1997]
dH(Qn, P) = sup
h∈H
|EQn [h(X)] − EP [h(Z)]|
Measures maximum discrepancy between sample and target
expectations over a class of real-valued test functions H
When H sufficiently large, convergence of dH(Qn, P) to zero
implies (Qn)n≥1 converges weakly to P (Requirement II)
Problem: Integration under P intractable!
⇒ Most IPMs cannot be computed in practice
Idea: Only consider functions with EP [h(Z)] known a priori to be 0
Then IPM computation only depends on Qn!
How do we select this class of test functions?
Will the resulting discrepancy measure track sample sequence
convergence (Requirements I and II)?
How do we solve the resulting optimization problem in practice?
Mackey (MSR) Kernel Stein Discrepancy December 12, 2017 5 / 26
Stein’s Method
Stein’s method [1972] provides a recipe for controlling convergence:
1 Identify operator T and set G of functions g : X → Rd
with
EP [(T g)(Z)] = 0 for all g ∈ G.
T and G together define the Stein discrepancy [Gorham and Mackey, 2015]
S(Qn, T , G) sup
g∈G
|EQn [(T g)(X)]| = dT G(Qn, P),
an IPM-type measure with no explicit integration under P
2 Lower bound S(Qn, T , G) by reference IPM dH(Qn, P )
⇒ S(Qn, T , G) → 0 only if (Qn)n≥1 converges to P (Req. II)
Performed once, in advance, for large classes of distributions
3 Upper bound S(Qn, T , G) by any means necessary to
demonstrate convergence to 0 (Requirement I)
Standard use: As analytical tool to prove convergence
Our goal: Develop Stein discrepancy into practical quality measure
Mackey (MSR) Kernel Stein Discrepancy December 12, 2017 6 / 26
Identifying a Stein Operator T
Goal: Identify operator T for which EP [(T g)(Z)] = 0 for all g ∈ G
Approach: Generator method of Barbour [1988, 1990], G¨otze [1991]
Identify a Markov process (Zt)t≥0 with stationary distribution P
Under mild conditions, its infinitesimal generator
(Au)(x) = lim
t→0
(E[u(Zt) | Z0 = x] − u(x))/t
satisfies EP [(Au)(Z)] = 0
Overdamped Langevin diffusion: dZt = 1
2
log p(Zt)dt + dWt
Generator: (AP u)(x) = 1
2
u(x), log p(x) + 1
2
, u(x)
Stein operator: (TP g)(x) g(x), log p(x) + , g(x)
[Gorham and Mackey, 2015, Oates, Girolami, and Chopin, 2016]
Depends on P only through log p; computable even if p
cannot be normalized!
Multivariate generalization of density method operator
(T g)(x) = g(x) d
dx log p(x) + g (x) [Stein, Diaconis, Holmes, and Reinert, 2004]
Mackey (MSR) Kernel Stein Discrepancy December 12, 2017 7 / 26
Identifying a Stein Set G
Goal: Identify set G for which EP [(TP g)(Z)] = 0 for all g ∈ G
Approach: Reproducing kernels k : X × X → R
A reproducing kernel k is symmetric (k(x, y) = k(y, x)) and
positive semidefinite ( i,l ciclk(zi, zl) ≥ 0, ∀zi ∈ X, ci ∈ R)
e.g., Gaussian kernel k(x, y) = e−1
2
x−y 2
2
Yields function space Kk = {f : f(x) = m
i=1 cik(zi, x), m ∈ N}
with norm f Kk
= m
i,l=1 ciclk(zi, zl)
We define the kernel Stein set of vector-valued g : X → Rd
as
Gk, · {g = (g1, . . . , gd) | v ∗
≤ 1 for vj gj Kk
}.
Each gj belongs to reproducing kernel Hilbert space (RKHS) Kk
Component norms vj gj Kk
are jointly bounded by 1
EP [(TP g)(Z)] = 0 for all g ∈ Gk, · whenever k ∈ C
(1,1)
b and
log p integrable under P [Gorham and Mackey, 2017]
Mackey (MSR) Kernel Stein Discrepancy December 12, 2017 8 / 26
Computing the Kernel Stein Discrepancy
Kernel Stein discrepancy (KSD) S(Qn, TP , Gk, · )
Stein operator (TP g)(x) g(x), log p(x) + , g(x)
Stein set Gk, · {g = (g1, . . . , gd) | v ∗
≤ 1 for vj gj Kk
}
Benefit: Computable in closed form [Gorham and Mackey, 2017]
S(Qn, TP , Gk, · ) = w for wj
n
i,i =1 kj
0(xi, xi ).
Reduces to parallelizable pairwise evaluations of Stein kernels
kj
0(x, y) 1
p(x)p(y) xj yj (p(x)k(x, y)p(y))
Stein set choice inspired by control functional kernels
k0 = d
j=1 kj
0 of Oates, Girolami, and Chopin [2016]
When · = · 2, recovers the KSD of Chwialkowski, Strathmann, and
Gretton [2016], Liu, Lee, and Jordan [2016]
To ease notation, will use Gk Gk, · 2
in remainder of the talk
Mackey (MSR) Kernel Stein Discrepancy December 12, 2017 9 / 26
Detecting Non-convergence
Goal: Show S(Qn, TP , Gk) → 0 only if (Qn)n≥1 converges to P
Theorem (Univarite KSD detects non-convergence [Gorham and Mackey, 2017])
Suppose P ∈ P and k(x, y) = Φ(x − y) for Φ ∈ C2
with a
non-vanishing generalized Fourier transform. If d = 1, then
S(Qn, TP , Gk) → 0 only if (Qn)n≥1 converges weakly to P.
P is the set of targets P with Lipschitz log p and distant
strong log concavity ( log(p(x)/p(y)),y−x
x−y 2
2
≥ k for x − y 2 ≥ r)
Includes Gaussian mixtures with common covariance, Bayesian
logistic and Student’s t regression with Gaussian priors, ...
Justifies use of KSD with Gaussian, Mat´ern, or inverse
multiquadric kernels k in the univariate case
Mackey (MSR) Kernel Stein Discrepancy December 12, 2017 10 / 26
Detecting Non-convergence
Goal: Show S(Qn, TP , Gk) → 0 only if Qn converges to P
In higher dimensions, KSDs based on common kernels fail to
detect non-convergence, even for Gaussian targets P
Theorem (KSD fails with light kernel tails [Gorham and Mackey, 2017])
Suppose d ≥ 3, P = N(0, Id), and α (1
2
− 1
d
)−1
. If k(x, y) and its
derivatives decay at a o( x − y −α
2 ) rate as x − y 2 → ∞, then
S(Qn, TP , Gk) → 0 for some (Qn)n≥1 not converging to P.
Gaussian (k(x, y) = e−1
2
x−y 2
2 ) and Mat´ern kernels fail for d ≥ 3
Inverse multiquadric kernels (k(x, y) = (1 + x − y 2
2)β
) with
β < −1 fail for d > 2β
1+β
The violating sample sequences (Qn)n≥1 are simple to construct
Problem: Kernels with light tails ignore excess mass in the tails
Mackey (MSR) Kernel Stein Discrepancy December 12, 2017 11 / 26
The Importance of Tightness
Goal: Show S(Qn, TP , Gk) → 0 only if Qn converges to P
A sequence (Qn)n≥1 is uniformly tight if for every > 0, there
is a finite number R( ) such that supn Qn( X 2 > R( )) ≤
Intuitively, no mass in the sequence escapes to infinity
Theorem (KSD detects tight non-convergence [Gorham and Mackey, 2017])
Suppose that P ∈ P and k(x, y) = Φ(x − y) for Φ ∈ C2
with a
non-vanishing generalized Fourier transform. If (Qn)n≥1 is uniformly
tight and S(Qn, TP , Gk) → 0, then (Qn)n≥1 converges weakly to P.
Good news, but, ideally, KSD would detect non-tight sequences
automatically...
Mackey (MSR) Kernel Stein Discrepancy December 12, 2017 12 / 26
Detecting Non-convergence
Goal: Show S(Qn, TP , Gk) → 0 only if Qn converges to P
Consider the inverse multiquadric (IMQ) kernel
k(x, y) = (c2
+ x − y 2
2)β
for some β < 0, c ∈ R.
IMQ KSD fails to detect non-convergence when β < −1
However, IMQ KSD detects non-convergence when β ∈ (−1, 0)
Theorem (IMQ KSD detects non-convergence [Gorham and Mackey, 2017])
Suppose P ∈ P and k(x, y) = (c2
+ x − y 2
2)β
for β ∈ (−1, 0). If
S(Qn, TP , Gk) → 0, then (Qn)n≥1 converges weakly to P.
Mackey (MSR) Kernel Stein Discrepancy December 12, 2017 13 / 26
Detecting Convergence
Goal: Show S(Qn, TP , Gk) → 0 when Qn converges to P
Proposition (KSD detects convergence [Gorham and Mackey, 2017])
If k ∈ C
(2,2)
b and log p Lipschitz and square integrable under P,
then S(Qn, TP , Gk) → 0 whenever the Wasserstein distance
dW · 2
(Qn, P) → 0.
Covers Gaussian, Mat´ern, IMQ, and other common bounded
kernels k
Mackey (MSR) Kernel Stein Discrepancy December 12, 2017 14 / 26
A Simple Example
i.i.d. from mixture
target P
i.i.d. from single
mixture component
101
102
103
104
101
102
103
104
10−2.5
10−2
10−1.5
10−1
10−0.5
100
Number of sample points, n
Discrepancyvalue
i.i.d. from mixture
target P
i.i.d. from single
mixture component
gh=TPg
−3 0 3 −3 0 3
0.2
0.4
0.6
0.8
1.0
−1.0
−0.5
0.0
0.5
x
Left plot:
For target p(x) ∝ e−1
2
(x+1.5)2
+ e−1
2
(x−1.5)2
, compare an i.i.d.
sample Qn from P and an i.i.d. sample Qn from one component
Expect S(Q1:n, TP , Gk) → 0 & S(Q1:n, TP , Gk) → 0
Compare IMQ KSD (β = −1/2, c = 1) with Wasserstein distance
Mackey (MSR) Kernel Stein Discrepancy December 12, 2017 15 / 26
A Simple Example
i.i.d. from mixture
target P
i.i.d. from single
mixture component
101
102
103
104
101
102
103
104
10−2.5
10−2
10−1.5
10−1
10−0.5
100
Number of sample points, n
Discrepancyvalue
i.i.d. from mixture
target P
i.i.d. from single
mixture component
gh=TPg
−3 0 3 −3 0 3
0.2
0.4
0.6
0.8
1.0
−1.0
−0.5
0.0
0.5
x
Right plot: For n = 103
sample points,
(Top) Recovered optimal Stein functions g ∈ Gk
(Bottom) Associated test functions h TP g which best
discriminate sample Qn from target P
Mackey (MSR) Kernel Stein Discrepancy December 12, 2017 16 / 26
The Importance of Kernel Choice
i.i.d. from target P Off−target sample
q
q
q
qqqqqqq
q
q
q
qqqqqq
q
q
qqqqqqq
q
q
q
qqqqqqq
q
q
q
qqqqqq
q
q
qqqqqqq
q
q
q
qqqqqq
q
q
q
q
qqqqqq
q
q qqqqqqq
q q q qqqqqqq
q q qqqqqqq
q q qqqqqqq
q q q qqqqqqq
q q qqqqqqq
q q qqqqqqq
q q q qqqqqqq q q qqqqqqq q q qqqqqqq
10−2
10−1
100
10−1
100
10−2
10−1
100
GaussianMatérnInverseMultiquadric
102
103
104
105
102
103
104
105
Number of sample points, n
KernelSteindiscrepancy
Dimension
q
d = 5
d = 8
d = 20
Target P = N(0, Id)
Off-target Qn has all
xi 2 ≤ 2n1/d
log n,
xi − xj 2 ≥ 2 log n
Gaussian and Mat´ern
KSDs driven to 0 by
an off-target sequence
that does not converge
to P
IMQ KSD
(β = −1
2
, c = 1) does
not have this
deficiency
Mackey (MSR) Kernel Stein Discrepancy December 12, 2017 17 / 26
Selecting Sampler Hyperparameters
q q
q
q
q
ESS (higher is better)
KSD (lower is better)
2.0
2.5
3.0
3.5
−0.5
0.0
0.5
0 10−4
10−3
10−2
10−1
Tolerance parameter, ε
ε = 0 (n = 230) ε = 10−2
(n = 416) ε = 10−1
(n = 1000)
−3
−2
−1
0
1
2
3
−2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2
x1
x2
Approximate slice sampling [DuBois, Korattikara, Welling, and Smyth, 2014]
Tolerance parameter controls number of datapoints evaluated
too small ⇒ too few sample points generated
too large ⇒ sampling from very different distribution
Standard MCMC selection criteria like effective sample size
(ESS) and asymptotic variance do not account for this bias
ESS maximized at tolerance = 10−1
, IMQ KSD minimized at
tolerance = 10−2
Mackey (MSR) Kernel Stein Discrepancy December 12, 2017 18 / 26
Selecting Samplers
Stochastic Gradient Fisher Scoring (SGFS)
[Ahn, Korattikara, and Welling, 2012]
Approximate MCMC procedure designed for scalability
Approximates Metropolis-adjusted Langevin algorithm but does
not use Metropolis-Hastings correction
Target P is not stationary distribution
Goal: Choose between two variants
SGFS-f inverts a d × d matrix for each new sample point
SGFS-d inverts a diagonal matrix to reduce sampling time
MNIST handwritten digits [Ahn, Korattikara, and Welling, 2012]
10000 images, 51 features, binary label indicating whether
image of a 7 or a 9
Bayesian logistic regression posterior P
Mackey (MSR) Kernel Stein Discrepancy December 12, 2017 19 / 26
Selecting Samplers
q
q
q
q qq
q
q
qq
q q q
qqqqqq q q q q
17500
18000
18500
19000
102
102.5
103
103.5
104
104.5
Number of sample points, n
IMQkernelSteindiscrepancy
Sampler
q SGFS−d
SGFS−f
SGFS−d
qq
−0.4
−0.3
−0.2
BEST
−0.3 −0.2 −0.1 0.0
x7
x51
SGFS−d
qq
0.8
0.9
1.0
1.1
1.2
WORST
−0.5 −0.4 −0.3 −0.2 −0.1 0.0
x8
x42
SGFS−f
qq
−0.1
0.0
0.1
0.2
BEST
−0.1 0.0 0.1 0.2
x32
x34
SGFS−f
qq
−0.6
−0.5
−0.4
−0.3
WORST
−1.5 −1.4 −1.3 −1.2 −1.1
x2
x25
Left: IMQ KSD quality comparison for SGFS Bayesian logistic
regression (no surrogate ground truth used)
Right: SGFS sample points (n = 5 × 104
) with bivariate
marginal means and 95% confidence ellipses (blue) that align
best and worst with surrogate ground truth sample (red)
Both suggest small speed-up of SGFS-d (0.0017s per sample vs.
0.0019s for SGFS-f) outweighed by loss in inferential accuracy
Mackey (MSR) Kernel Stein Discrepancy December 12, 2017 20 / 26
Beyond Sample Quality Comparison
One-sample hypothesis testing
Chwialkowski, Strathmann, and Gretton [2016] used the KSD S(Qn, TP , Gk)
to test whether a sample was drawn from a target distribution P
(see also Liu, Lee, and Jordan [2016])
Test with default Gaussian kernel k experienced considerable loss
of power as the dimension d increased
We recreate their experiment with IMQ kernel (β = −1
2
, c = 1)
For n = 500, generate sample (xi)n
i=1 with xi = zi + ui e1
zi
iid
∼ N(0, Id) and ui
iid
∼ Unif[0, 1]. Target P = N(0, Id).
Compare with standard normality test of Baringhaus and Henze [1988]
Table: Mean power of multivariate normality tests across 400 simulations
d=2 d=5 d=10 d=15 d=20 d=25
B&H 1.0 1.0 1.0 0.91 0.57 0.26
Gaussian 1.0 1.0 0.88 0.29 0.12 0.02
IMQ 1.0 1.0 1.0 1.0 1.0 1.0
Mackey (MSR) Kernel Stein Discrepancy December 12, 2017 21 / 26
Beyond Sample Quality Comparison
Improving sample quality
Given sample points (xi)n
i=1, can minimize KSD S( ˜Qn, TP , Gk)
over all weighted samples ˜Qn = n
i=1 qn(xi)δxi
for qn a
probability mass function
Liu and Lee [2016] do this with Gaussian kernel k(x, y) = e− 1
h
x−y 2
2
Bandwidth h set to median of the squared Euclidean distance
between pairs of sample points
We recreate their experiment with the IMQ kernel
k(x, y) = (1 + 1
h
x − y 2
2)−1/2
Mackey (MSR) Kernel Stein Discrepancy December 12, 2017 22 / 26
Improving Sample Quality
q q q q q
10−4.5
10−4
10−3.5
10−3
10−2.5
10−2
2 10 50 75 100
Dimension, d
AverageMSE,||EPZ−EQn
~X||2
2
d
Sample
q Initial Qn
Gaussian KSD
IMQ KSD
MSE averaged over 500 simulations (±2 standard errors)
Target P = N(0, Id)
Starting sample Qn = 1
n
n
i=1 δxi
for xi
iid
∼ P, n = 100.
Mackey (MSR) Kernel Stein Discrepancy December 12, 2017 23 / 26
Future Directions
Many opportunities for future development
1 Improve scalability of KSD while maintaining convergence
determining properties
Low-rank, sparse, or stochastic approximations of kernel matrix
Subsampling of likelihood terms in log p
2 Addressing other inferential tasks
Design of control variates [Oates, Girolami, and Chopin, 2016]
Variational inference [Liu and Wang, 2016, Liu and Feng, 2016]
Training generative adversarial networks [Wang and Liu, 2016] and
variational autoencoders [Pu, Gan, Henao, Li, Han, and Carin, 2017]
3 Exploring the impact of Stein operator choice
An infinite number of operators T characterize P
How is discrepancy impacted? How do we select the best T ?
Thm: If log p bounded and k ∈ C
(1,1)
0 , then
S(Qn, TP , Gk) → 0 for some (Qn)n≥1 not converging to P
Diffusion Stein operators (T g)(x) = 1
p(x) , p(x)m(x)g(x) of
Gorham, Duncan, Vollmer, and Mackey [2016] may be appropriate for heavy tails
Mackey (MSR) Kernel Stein Discrepancy December 12, 2017 24 / 26
References I
S. Ahn, A. Korattikara, and M. Welling. Bayesian posterior sampling via stochastic gradient Fisher scoring. In Proc. 29th ICML,
ICML’12, 2012.
A. D. Barbour. Stein’s method and Poisson process convergence. J. Appl. Probab., (Special Vol. 25A):175–184, 1988. ISSN
0021-9002. A celebration of applied probability.
A. D. Barbour. Stein’s method for diffusion approximations. Probab. Theory Related Fields, 84(3):297–322, 1990. ISSN
0178-8051. doi: 10.1007/BF01197887.
L. Baringhaus and N. Henze. A consistent test for multivariate normality based on the empirical characteristic function.
Metrika, 35(1):339–348, 1988.
K. Chwialkowski, H. Strathmann, and A. Gretton. A kernel test of goodness of fit. In Proc. 33rd ICML, ICML, 2016.
C. DuBois, A. Korattikara, M. Welling, and P. Smyth. Approximate slice sampling for Bayesian posterior inference. In Proc.
17th AISTATS, pages 185–193, 2014.
J. Gorham and L. Mackey. Measuring sample quality with Stein’s method. In C. Cortes, N. D. Lawrence, D. D. Lee,
M. Sugiyama, and R. Garnett, editors, Adv. NIPS 28, pages 226–234. Curran Associates, Inc., 2015.
J. Gorham and L. Mackey. Measuring sample quality with kernels. arXiv:1703.01717, Mar. 2017.
J. Gorham, A. Duncan, S. Vollmer, and L. Mackey. Measuring sample quality with diffusions. arXiv:1611.06972, Nov. 2016.
F. G¨otze. On the rate of convergence in the multivariate CLT. Ann. Probab., 19(2):724–739, 1991.
A. Korattikara, Y. Chen, and M. Welling. Austerity in MCMC land: Cutting the Metropolis-Hastings budget. In Proc. of 31st
ICML, ICML’14, 2014.
Q. Liu and Y. Feng. Two methods for wild variational inference. arXiv preprint arXiv:1612.00081, 2016.
Q. Liu and J. Lee. Black-box importance sampling. arXiv:1610.05247, Oct. 2016. To appear in AISTATS 2017.
Q. Liu and D. Wang. Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm. arXiv:1608.04471,
Aug. 2016.
Q. Liu, J. Lee, and M. Jordan. A kernelized Stein discrepancy for goodness-of-fit tests. In Proc. of 33rd ICML, volume 48 of
ICML, pages 276–284, 2016.
L. Mackey and J. Gorham. Multivariate Stein factors for a class of strongly log-concave distributions. arXiv:1512.07392, 2015.
A. M¨uller. Integral probability metrics and their generating classes of functions. Ann. Appl. Probab., 29(2):pp. 429–443, 1997.
Mackey (MSR) Kernel Stein Discrepancy December 12, 2017 25 / 26
References II
C. Oates and M. Girolami. Control functionals for quasi-monte carlo integration. In A. Gretton and C. C. Robert, editors,
Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, volume 51 of Proceedings of
Machine Learning Research, pages 56–65, Cadiz, Spain, 09–11 May 2016. PMLR.
C. J. Oates, M. Girolami, and N. Chopin. Control functionals for Monte Carlo integration. Journal of the Royal Statistical
Society: Series B (Statistical Methodology), pages n/a–n/a, 2016. ISSN 1467-9868. doi: 10.1111/rssb.12185.
Y. Pu, Z. Gan, R. Henao, C. Li, S. Han, and L. Carin. Vae learning via stein variational gradient descent. In Advances in Neural
Information Processing Systems, pages 4237–4246, 2017.
C. Stein. A bound for the error in the normal approximation to the distribution of a sum of dependent random variables. In
Proc. 6th Berkeley Symposium on Mathematical Statistics and Probability (Univ. California, Berkeley, Calif., 1970/1971),
Vol. II: Probability theory, pages 583–602. Univ. California Press, Berkeley, Calif., 1972.
C. Stein, P. Diaconis, S. Holmes, and G. Reinert. Use of exchangeable pairs in the analysis of simulations. In Stein’s method:
expository lectures and applications, volume 46 of IMS Lecture Notes Monogr. Ser., pages 1–26. Inst. Math. Statist.,
Beachwood, OH, 2004.
D. Wang and Q. Liu. Learning to Draw Samples: With Application to Amortized MLE for Generative Adversarial Learning.
arXiv:1611.01722, Nov. 2016.
M. Welling and Y. Teh. Bayesian learning via stochastic gradient Langevin dynamics. In ICML, 2011.
Mackey (MSR) Kernel Stein Discrepancy December 12, 2017 26 / 26
Comparing Discrepancies
qq
qq
qq
qq
qq
qqqqqqqq
qqqq
qq
qqqq
qqqq
qqqq
qqqq
qqqq
qqqq
qq
qq
qqqq
qq
qq
qq
qq qq
qqqqqq
qqqqqq
qqqqqq qq
qqqqqqqqqqqqqqqq qqqqqqqqqqqqqqqqqq qqqq
i.i.d. from mixture
target P
i.i.d. from single
mixture component
101
102
103
104
101
102
103
104
10−2.5
10−2
10−1.5
10−1
10−0.5
100
Number of sample points, n
Discrepancyvalue
Discrepancy
q
IMQ KSD
Graph Stein
discrepancy
Wasserstein
q
qqqqqqqqq
q
q
q
qqq
qqq
qqq
qq
q
qq
q
q
q
q
qq
qqqqqq
q
q
q
q
q
q
q
q
q
q
q
q
q
d = 1 d = 4
101
102
103
104
101
102
103
104
10−3
10−2
10−1
100
101
102
103
Number of sample points, n
Computationtime(sec)
−
−
Left: Samples drawn i.i.d. from either the bimodal Gaussian
mixture target p(x) ∝ e−1
2
(x+1.5)2
+ e−1
2
(x−1.5)2
or a single
mixture component.
Right: Discrepancy computation time using d cores in d
dimensions.
Mackey (MSR) Kernel Stein Discrepancy December 12, 2017 27 / 26
Selecting Sampler Hyperparameters
Setup [Welling and Teh, 2011]
Consider the posterior distribution P induced by L datapoints yl
drawn i.i.d. from a Gaussian mixture likelihood
Yl|X
iid
∼ 1
2
N(X1, 2) + 1
2
N(X1 + X2, 2)
under Gaussian priors on the parameters X ∈ R2
X1 ∼ N(0, 10) ⊥⊥ X2 ∼ N(0, 1)
Draw m = 100 datapoints yl with parameters (x1, x2) = (0, 1)
Induces posterior with second mode at (x1, x2) = (1, −1)
For range of parameters , run approximate slice sampling for
148000 datapoint likelihood evaluations and store resulting
posterior sample Qn
Use minimum IMQ KSD (β = −1
2
, c = 1) to select appropriate
Compare with standard MCMC parameter selection criterion,
effective sample size (ESS), a measure of Markov chain
autocorrelation
Compute median of diagnostic over 50 random sequences
Mackey (MSR) Kernel Stein Discrepancy December 12, 2017 28 / 26
Selecting Samplers
Setup
MNIST handwritten digits [Ahn, Korattikara, and Welling, 2012]
10000 images, 51 features, binary label indicating whether
image of a 7 or a 9
Bayesian logistic regression posterior P
L independent observations (yl, vl) ∈ {1, −1} × Rd with
P(Yl = 1|vl, X) = 1/(1 + exp(− vl, X ))
Flat improper prior on the parameters X ∈ Rd
Use IMQ KSD (β = −1
2
, c = 1) to compare SGFS-f to SGFS-d
drawing 105
sample points and discarding first half as burn-in
For external support, compare bivariate marginal means and
95% confidence ellipses with surrogate ground truth Hamiltonian
Monte chain with 105
sample points [Ahn, Korattikara, and Welling, 2012]
Mackey (MSR) Kernel Stein Discrepancy December 12, 2017 29 / 26

More Related Content

What's hot

Delayed acceptance for Metropolis-Hastings algorithms
Delayed acceptance for Metropolis-Hastings algorithmsDelayed acceptance for Metropolis-Hastings algorithms
Delayed acceptance for Metropolis-Hastings algorithmsChristian Robert
 
ABC with Wasserstein distances
ABC with Wasserstein distancesABC with Wasserstein distances
ABC with Wasserstein distancesChristian Robert
 
Unbiased Bayes for Big Data
Unbiased Bayes for Big DataUnbiased Bayes for Big Data
Unbiased Bayes for Big DataChristian Robert
 
QMC Error SAMSI Tutorial Aug 2017
QMC Error SAMSI Tutorial Aug 2017QMC Error SAMSI Tutorial Aug 2017
QMC Error SAMSI Tutorial Aug 2017Fred J. Hickernell
 
Can we estimate a constant?
Can we estimate a constant?Can we estimate a constant?
Can we estimate a constant?Christian Robert
 
accurate ABC Oliver Ratmann
accurate ABC Oliver Ratmannaccurate ABC Oliver Ratmann
accurate ABC Oliver Ratmannolli0601
 
Testing for mixtures by seeking components
Testing for mixtures by seeking componentsTesting for mixtures by seeking components
Testing for mixtures by seeking componentsChristian Robert
 
Maximum likelihood estimation of regularisation parameters in inverse problem...
Maximum likelihood estimation of regularisation parameters in inverse problem...Maximum likelihood estimation of regularisation parameters in inverse problem...
Maximum likelihood estimation of regularisation parameters in inverse problem...Valentin De Bortoli
 
comments on exponential ergodicity of the bouncy particle sampler
comments on exponential ergodicity of the bouncy particle samplercomments on exponential ergodicity of the bouncy particle sampler
comments on exponential ergodicity of the bouncy particle samplerChristian Robert
 
Probabilistic Control of Switched Linear Systems with Chance Constraints
Probabilistic Control of Switched Linear Systems with Chance ConstraintsProbabilistic Control of Switched Linear Systems with Chance Constraints
Probabilistic Control of Switched Linear Systems with Chance ConstraintsLeo Asselborn
 
prior selection for mixture estimation
prior selection for mixture estimationprior selection for mixture estimation
prior selection for mixture estimationChristian Robert
 

What's hot (20)

Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
 
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
 
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
 
Delayed acceptance for Metropolis-Hastings algorithms
Delayed acceptance for Metropolis-Hastings algorithmsDelayed acceptance for Metropolis-Hastings algorithms
Delayed acceptance for Metropolis-Hastings algorithms
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
 
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
 
ABC with Wasserstein distances
ABC with Wasserstein distancesABC with Wasserstein distances
ABC with Wasserstein distances
 
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
 
Unbiased Bayes for Big Data
Unbiased Bayes for Big DataUnbiased Bayes for Big Data
Unbiased Bayes for Big Data
 
QMC Error SAMSI Tutorial Aug 2017
QMC Error SAMSI Tutorial Aug 2017QMC Error SAMSI Tutorial Aug 2017
QMC Error SAMSI Tutorial Aug 2017
 
Can we estimate a constant?
Can we estimate a constant?Can we estimate a constant?
Can we estimate a constant?
 
ABC in Venezia
ABC in VeneziaABC in Venezia
ABC in Venezia
 
accurate ABC Oliver Ratmann
accurate ABC Oliver Ratmannaccurate ABC Oliver Ratmann
accurate ABC Oliver Ratmann
 
Testing for mixtures by seeking components
Testing for mixtures by seeking componentsTesting for mixtures by seeking components
Testing for mixtures by seeking components
 
Maximum likelihood estimation of regularisation parameters in inverse problem...
Maximum likelihood estimation of regularisation parameters in inverse problem...Maximum likelihood estimation of regularisation parameters in inverse problem...
Maximum likelihood estimation of regularisation parameters in inverse problem...
 
comments on exponential ergodicity of the bouncy particle sampler
comments on exponential ergodicity of the bouncy particle samplercomments on exponential ergodicity of the bouncy particle sampler
comments on exponential ergodicity of the bouncy particle sampler
 
Probabilistic Control of Switched Linear Systems with Chance Constraints
Probabilistic Control of Switched Linear Systems with Chance ConstraintsProbabilistic Control of Switched Linear Systems with Chance Constraints
Probabilistic Control of Switched Linear Systems with Chance Constraints
 
prior selection for mixture estimation
prior selection for mixture estimationprior selection for mixture estimation
prior selection for mixture estimation
 

Similar to QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop, Measuring Sample Quality with Kernels - Lester Mackey, Dec 12, 2017

Divergence clustering
Divergence clusteringDivergence clustering
Divergence clusteringFrank Nielsen
 
Matrix Completion Presentation
Matrix Completion PresentationMatrix Completion Presentation
Matrix Completion PresentationMichael Hankin
 
Litvinenko low-rank kriging +FFT poster
Litvinenko low-rank kriging +FFT  posterLitvinenko low-rank kriging +FFT  poster
Litvinenko low-rank kriging +FFT posterAlexander Litvinenko
 
SPDE presentation 2012
SPDE presentation 2012SPDE presentation 2012
SPDE presentation 2012Zheng Mengdi
 
Divergence center-based clustering and their applications
Divergence center-based clustering and their applicationsDivergence center-based clustering and their applications
Divergence center-based clustering and their applicationsFrank Nielsen
 
Accelerating Pseudo-Marginal MCMC using Gaussian Processes
Accelerating Pseudo-Marginal MCMC using Gaussian ProcessesAccelerating Pseudo-Marginal MCMC using Gaussian Processes
Accelerating Pseudo-Marginal MCMC using Gaussian ProcessesMatt Moores
 
SOLVING BVPs OF SINGULARLY PERTURBED DISCRETE SYSTEMS
SOLVING BVPs OF SINGULARLY PERTURBED DISCRETE SYSTEMSSOLVING BVPs OF SINGULARLY PERTURBED DISCRETE SYSTEMS
SOLVING BVPs OF SINGULARLY PERTURBED DISCRETE SYSTEMSTahia ZERIZER
 
Improved Trainings of Wasserstein GANs (WGAN-GP)
Improved Trainings of Wasserstein GANs (WGAN-GP)Improved Trainings of Wasserstein GANs (WGAN-GP)
Improved Trainings of Wasserstein GANs (WGAN-GP)Sangwoo Mo
 
Computational Information Geometry on Matrix Manifolds (ICTP 2013)
Computational Information Geometry on Matrix Manifolds (ICTP 2013)Computational Information Geometry on Matrix Manifolds (ICTP 2013)
Computational Information Geometry on Matrix Manifolds (ICTP 2013)Frank Nielsen
 
Numerical Evidence for Darmon Points
Numerical Evidence for Darmon PointsNumerical Evidence for Darmon Points
Numerical Evidence for Darmon Pointsmmasdeu
 
(α ψ)- Construction with q- function for coupled fixed point
(α   ψ)-  Construction with q- function for coupled fixed point(α   ψ)-  Construction with q- function for coupled fixed point
(α ψ)- Construction with q- function for coupled fixed pointAlexander Decker
 
On Clustering Histograms with k-Means by Using Mixed α-Divergences
 On Clustering Histograms with k-Means by Using Mixed α-Divergences On Clustering Histograms with k-Means by Using Mixed α-Divergences
On Clustering Histograms with k-Means by Using Mixed α-DivergencesFrank Nielsen
 
Bayesian inference on mixtures
Bayesian inference on mixturesBayesian inference on mixtures
Bayesian inference on mixturesChristian Robert
 
Hierarchical Deterministic Quadrature Methods for Option Pricing under the Ro...
Hierarchical Deterministic Quadrature Methods for Option Pricing under the Ro...Hierarchical Deterministic Quadrature Methods for Option Pricing under the Ro...
Hierarchical Deterministic Quadrature Methods for Option Pricing under the Ro...Chiheb Ben Hammouda
 
A common fixed point theorem in cone metric spaces
A common fixed point theorem in cone metric spacesA common fixed point theorem in cone metric spaces
A common fixed point theorem in cone metric spacesAlexander Decker
 
CVPR2010: Advanced ITinCVPR in a Nutshell: part 6: Mixtures
CVPR2010: Advanced ITinCVPR in a Nutshell: part 6: MixturesCVPR2010: Advanced ITinCVPR in a Nutshell: part 6: Mixtures
CVPR2010: Advanced ITinCVPR in a Nutshell: part 6: Mixtureszukun
 

Similar to QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop, Measuring Sample Quality with Kernels - Lester Mackey, Dec 12, 2017 (20)

Divergence clustering
Divergence clusteringDivergence clustering
Divergence clustering
 
Matrix Completion Presentation
Matrix Completion PresentationMatrix Completion Presentation
Matrix Completion Presentation
 
Litvinenko low-rank kriging +FFT poster
Litvinenko low-rank kriging +FFT  posterLitvinenko low-rank kriging +FFT  poster
Litvinenko low-rank kriging +FFT poster
 
SPDE presentation 2012
SPDE presentation 2012SPDE presentation 2012
SPDE presentation 2012
 
The Gaussian Hardy-Littlewood Maximal Function
The Gaussian Hardy-Littlewood Maximal FunctionThe Gaussian Hardy-Littlewood Maximal Function
The Gaussian Hardy-Littlewood Maximal Function
 
Divergence center-based clustering and their applications
Divergence center-based clustering and their applicationsDivergence center-based clustering and their applications
Divergence center-based clustering and their applications
 
Accelerating Pseudo-Marginal MCMC using Gaussian Processes
Accelerating Pseudo-Marginal MCMC using Gaussian ProcessesAccelerating Pseudo-Marginal MCMC using Gaussian Processes
Accelerating Pseudo-Marginal MCMC using Gaussian Processes
 
SOLVING BVPs OF SINGULARLY PERTURBED DISCRETE SYSTEMS
SOLVING BVPs OF SINGULARLY PERTURBED DISCRETE SYSTEMSSOLVING BVPs OF SINGULARLY PERTURBED DISCRETE SYSTEMS
SOLVING BVPs OF SINGULARLY PERTURBED DISCRETE SYSTEMS
 
Improved Trainings of Wasserstein GANs (WGAN-GP)
Improved Trainings of Wasserstein GANs (WGAN-GP)Improved Trainings of Wasserstein GANs (WGAN-GP)
Improved Trainings of Wasserstein GANs (WGAN-GP)
 
QMC: Operator Splitting Workshop, Proximal Algorithms in Probability Spaces -...
QMC: Operator Splitting Workshop, Proximal Algorithms in Probability Spaces -...QMC: Operator Splitting Workshop, Proximal Algorithms in Probability Spaces -...
QMC: Operator Splitting Workshop, Proximal Algorithms in Probability Spaces -...
 
Computational Information Geometry on Matrix Manifolds (ICTP 2013)
Computational Information Geometry on Matrix Manifolds (ICTP 2013)Computational Information Geometry on Matrix Manifolds (ICTP 2013)
Computational Information Geometry on Matrix Manifolds (ICTP 2013)
 
Numerical Evidence for Darmon Points
Numerical Evidence for Darmon PointsNumerical Evidence for Darmon Points
Numerical Evidence for Darmon Points
 
talk MCMC & SMC 2004
talk MCMC & SMC 2004talk MCMC & SMC 2004
talk MCMC & SMC 2004
 
(α ψ)- Construction with q- function for coupled fixed point
(α   ψ)-  Construction with q- function for coupled fixed point(α   ψ)-  Construction with q- function for coupled fixed point
(α ψ)- Construction with q- function for coupled fixed point
 
On Clustering Histograms with k-Means by Using Mixed α-Divergences
 On Clustering Histograms with k-Means by Using Mixed α-Divergences On Clustering Histograms with k-Means by Using Mixed α-Divergences
On Clustering Histograms with k-Means by Using Mixed α-Divergences
 
Vancouver18
Vancouver18Vancouver18
Vancouver18
 
Bayesian inference on mixtures
Bayesian inference on mixturesBayesian inference on mixtures
Bayesian inference on mixtures
 
Hierarchical Deterministic Quadrature Methods for Option Pricing under the Ro...
Hierarchical Deterministic Quadrature Methods for Option Pricing under the Ro...Hierarchical Deterministic Quadrature Methods for Option Pricing under the Ro...
Hierarchical Deterministic Quadrature Methods for Option Pricing under the Ro...
 
A common fixed point theorem in cone metric spaces
A common fixed point theorem in cone metric spacesA common fixed point theorem in cone metric spaces
A common fixed point theorem in cone metric spaces
 
CVPR2010: Advanced ITinCVPR in a Nutshell: part 6: Mixtures
CVPR2010: Advanced ITinCVPR in a Nutshell: part 6: MixturesCVPR2010: Advanced ITinCVPR in a Nutshell: part 6: Mixtures
CVPR2010: Advanced ITinCVPR in a Nutshell: part 6: Mixtures
 

More from The Statistical and Applied Mathematical Sciences Institute

More from The Statistical and Applied Mathematical Sciences Institute (20)

Causal Inference Opening Workshop - Latent Variable Models, Causal Inference,...
Causal Inference Opening Workshop - Latent Variable Models, Causal Inference,...Causal Inference Opening Workshop - Latent Variable Models, Causal Inference,...
Causal Inference Opening Workshop - Latent Variable Models, Causal Inference,...
 
2019 Fall Series: Special Guest Lecture - 0-1 Phase Transitions in High Dimen...
2019 Fall Series: Special Guest Lecture - 0-1 Phase Transitions in High Dimen...2019 Fall Series: Special Guest Lecture - 0-1 Phase Transitions in High Dimen...
2019 Fall Series: Special Guest Lecture - 0-1 Phase Transitions in High Dimen...
 
Causal Inference Opening Workshop - Causal Discovery in Neuroimaging Data - F...
Causal Inference Opening Workshop - Causal Discovery in Neuroimaging Data - F...Causal Inference Opening Workshop - Causal Discovery in Neuroimaging Data - F...
Causal Inference Opening Workshop - Causal Discovery in Neuroimaging Data - F...
 
Causal Inference Opening Workshop - Smooth Extensions to BART for Heterogeneo...
Causal Inference Opening Workshop - Smooth Extensions to BART for Heterogeneo...Causal Inference Opening Workshop - Smooth Extensions to BART for Heterogeneo...
Causal Inference Opening Workshop - Smooth Extensions to BART for Heterogeneo...
 
Causal Inference Opening Workshop - A Bracketing Relationship between Differe...
Causal Inference Opening Workshop - A Bracketing Relationship between Differe...Causal Inference Opening Workshop - A Bracketing Relationship between Differe...
Causal Inference Opening Workshop - A Bracketing Relationship between Differe...
 
Causal Inference Opening Workshop - Testing Weak Nulls in Matched Observation...
Causal Inference Opening Workshop - Testing Weak Nulls in Matched Observation...Causal Inference Opening Workshop - Testing Weak Nulls in Matched Observation...
Causal Inference Opening Workshop - Testing Weak Nulls in Matched Observation...
 
Causal Inference Opening Workshop - Difference-in-differences: more than meet...
Causal Inference Opening Workshop - Difference-in-differences: more than meet...Causal Inference Opening Workshop - Difference-in-differences: more than meet...
Causal Inference Opening Workshop - Difference-in-differences: more than meet...
 
Causal Inference Opening Workshop - New Statistical Learning Methods for Esti...
Causal Inference Opening Workshop - New Statistical Learning Methods for Esti...Causal Inference Opening Workshop - New Statistical Learning Methods for Esti...
Causal Inference Opening Workshop - New Statistical Learning Methods for Esti...
 
Causal Inference Opening Workshop - Bipartite Causal Inference with Interfere...
Causal Inference Opening Workshop - Bipartite Causal Inference with Interfere...Causal Inference Opening Workshop - Bipartite Causal Inference with Interfere...
Causal Inference Opening Workshop - Bipartite Causal Inference with Interfere...
 
Causal Inference Opening Workshop - Bridging the Gap Between Causal Literatur...
Causal Inference Opening Workshop - Bridging the Gap Between Causal Literatur...Causal Inference Opening Workshop - Bridging the Gap Between Causal Literatur...
Causal Inference Opening Workshop - Bridging the Gap Between Causal Literatur...
 
Causal Inference Opening Workshop - Some Applications of Reinforcement Learni...
Causal Inference Opening Workshop - Some Applications of Reinforcement Learni...Causal Inference Opening Workshop - Some Applications of Reinforcement Learni...
Causal Inference Opening Workshop - Some Applications of Reinforcement Learni...
 
Causal Inference Opening Workshop - Bracketing Bounds for Differences-in-Diff...
Causal Inference Opening Workshop - Bracketing Bounds for Differences-in-Diff...Causal Inference Opening Workshop - Bracketing Bounds for Differences-in-Diff...
Causal Inference Opening Workshop - Bracketing Bounds for Differences-in-Diff...
 
Causal Inference Opening Workshop - Assisting the Impact of State Polcies: Br...
Causal Inference Opening Workshop - Assisting the Impact of State Polcies: Br...Causal Inference Opening Workshop - Assisting the Impact of State Polcies: Br...
Causal Inference Opening Workshop - Assisting the Impact of State Polcies: Br...
 
Causal Inference Opening Workshop - Experimenting in Equilibrium - Stefan Wag...
Causal Inference Opening Workshop - Experimenting in Equilibrium - Stefan Wag...Causal Inference Opening Workshop - Experimenting in Equilibrium - Stefan Wag...
Causal Inference Opening Workshop - Experimenting in Equilibrium - Stefan Wag...
 
Causal Inference Opening Workshop - Targeted Learning for Causal Inference Ba...
Causal Inference Opening Workshop - Targeted Learning for Causal Inference Ba...Causal Inference Opening Workshop - Targeted Learning for Causal Inference Ba...
Causal Inference Opening Workshop - Targeted Learning for Causal Inference Ba...
 
Causal Inference Opening Workshop - Bayesian Nonparametric Models for Treatme...
Causal Inference Opening Workshop - Bayesian Nonparametric Models for Treatme...Causal Inference Opening Workshop - Bayesian Nonparametric Models for Treatme...
Causal Inference Opening Workshop - Bayesian Nonparametric Models for Treatme...
 
2019 Fall Series: Special Guest Lecture - Adversarial Risk Analysis of the Ge...
2019 Fall Series: Special Guest Lecture - Adversarial Risk Analysis of the Ge...2019 Fall Series: Special Guest Lecture - Adversarial Risk Analysis of the Ge...
2019 Fall Series: Special Guest Lecture - Adversarial Risk Analysis of the Ge...
 
2019 Fall Series: Professional Development, Writing Academic Papers…What Work...
2019 Fall Series: Professional Development, Writing Academic Papers…What Work...2019 Fall Series: Professional Development, Writing Academic Papers…What Work...
2019 Fall Series: Professional Development, Writing Academic Papers…What Work...
 
2019 GDRR: Blockchain Data Analytics - Machine Learning in/for Blockchain: Fu...
2019 GDRR: Blockchain Data Analytics - Machine Learning in/for Blockchain: Fu...2019 GDRR: Blockchain Data Analytics - Machine Learning in/for Blockchain: Fu...
2019 GDRR: Blockchain Data Analytics - Machine Learning in/for Blockchain: Fu...
 
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...
 

Recently uploaded

Arti Languages Pre Seed Send Ahead Pitchdeck 2024.pdf
Arti Languages Pre Seed Send Ahead Pitchdeck 2024.pdfArti Languages Pre Seed Send Ahead Pitchdeck 2024.pdf
Arti Languages Pre Seed Send Ahead Pitchdeck 2024.pdfwill854175
 
25 CHUYÊN ĐỀ ÔN THI TỐT NGHIỆP THPT 2023 – BÀI TẬP PHÁT TRIỂN TỪ ĐỀ MINH HỌA...
25 CHUYÊN ĐỀ ÔN THI TỐT NGHIỆP THPT 2023 – BÀI TẬP PHÁT TRIỂN TỪ ĐỀ MINH HỌA...25 CHUYÊN ĐỀ ÔN THI TỐT NGHIỆP THPT 2023 – BÀI TẬP PHÁT TRIỂN TỪ ĐỀ MINH HỌA...
25 CHUYÊN ĐỀ ÔN THI TỐT NGHIỆP THPT 2023 – BÀI TẬP PHÁT TRIỂN TỪ ĐỀ MINH HỌA...Nguyen Thanh Tu Collection
 
30-de-thi-vao-lop-10-mon-tieng-anh-co-dap-an.doc
30-de-thi-vao-lop-10-mon-tieng-anh-co-dap-an.doc30-de-thi-vao-lop-10-mon-tieng-anh-co-dap-an.doc
30-de-thi-vao-lop-10-mon-tieng-anh-co-dap-an.docdieu18
 
AUDIENCE THEORY - PARTICIPATORY - JENKINS.pptx
AUDIENCE THEORY - PARTICIPATORY - JENKINS.pptxAUDIENCE THEORY - PARTICIPATORY - JENKINS.pptx
AUDIENCE THEORY - PARTICIPATORY - JENKINS.pptxiammrhaywood
 
2024.03.16 How to write better quality materials for your learners ELTABB San...
2024.03.16 How to write better quality materials for your learners ELTABB San...2024.03.16 How to write better quality materials for your learners ELTABB San...
2024.03.16 How to write better quality materials for your learners ELTABB San...Sandy Millin
 
The OERs: Transforming Education for Sustainable Future by Dr. Sarita Anand
The OERs: Transforming Education for Sustainable Future by Dr. Sarita AnandThe OERs: Transforming Education for Sustainable Future by Dr. Sarita Anand
The OERs: Transforming Education for Sustainable Future by Dr. Sarita AnandDr. Sarita Anand
 
LEAD6001 - Introduction to Advanced Stud
LEAD6001 - Introduction to Advanced StudLEAD6001 - Introduction to Advanced Stud
LEAD6001 - Introduction to Advanced StudDr. Bruce A. Johnson
 
DLL Catch Up Friday March 22.docx CATCH UP FRIDAYS
DLL Catch Up Friday March 22.docx CATCH UP FRIDAYSDLL Catch Up Friday March 22.docx CATCH UP FRIDAYS
DLL Catch Up Friday March 22.docx CATCH UP FRIDAYSTeacherNicaPrintable
 
Awards Presentation 2024 - March 12 2024
Awards Presentation 2024 - March 12 2024Awards Presentation 2024 - March 12 2024
Awards Presentation 2024 - March 12 2024bsellato
 
Dhavni Theory by Anandvardhana Indian Poetics
Dhavni Theory by Anandvardhana Indian PoeticsDhavni Theory by Anandvardhana Indian Poetics
Dhavni Theory by Anandvardhana Indian PoeticsDhatriParmar
 
The basics of sentences session 8pptx.pptx
The basics of sentences session 8pptx.pptxThe basics of sentences session 8pptx.pptx
The basics of sentences session 8pptx.pptxheathfieldcps1
 
Metabolism , Metabolic Fate& disorders of cholesterol.pptx
Metabolism , Metabolic Fate& disorders of cholesterol.pptxMetabolism , Metabolic Fate& disorders of cholesterol.pptx
Metabolism , Metabolic Fate& disorders of cholesterol.pptxDr. Santhosh Kumar. N
 
3.14.24 Gender Discrimination and Gender Inequity.pptx
3.14.24 Gender Discrimination and Gender Inequity.pptx3.14.24 Gender Discrimination and Gender Inequity.pptx
3.14.24 Gender Discrimination and Gender Inequity.pptxmary850239
 
3.12.24 Freedom Summer in Mississippi.pptx
3.12.24 Freedom Summer in Mississippi.pptx3.12.24 Freedom Summer in Mississippi.pptx
3.12.24 Freedom Summer in Mississippi.pptxmary850239
 
Research Methodology and Tips on Better Research
Research Methodology and Tips on Better ResearchResearch Methodology and Tips on Better Research
Research Methodology and Tips on Better ResearchRushdi Shams
 
3.14.24 The Selma March and the Voting Rights Act.pptx
3.14.24 The Selma March and the Voting Rights Act.pptx3.14.24 The Selma March and the Voting Rights Act.pptx
3.14.24 The Selma March and the Voting Rights Act.pptxmary850239
 
PHARMACOGNOSY CHAPTER NO 5 CARMINATIVES AND G.pdf
PHARMACOGNOSY CHAPTER NO 5 CARMINATIVES AND G.pdfPHARMACOGNOSY CHAPTER NO 5 CARMINATIVES AND G.pdf
PHARMACOGNOSY CHAPTER NO 5 CARMINATIVES AND G.pdfSumit Tiwari
 
Certification Study Group - Professional ML Engineer Session 3 (Machine Learn...
Certification Study Group - Professional ML Engineer Session 3 (Machine Learn...Certification Study Group - Professional ML Engineer Session 3 (Machine Learn...
Certification Study Group - Professional ML Engineer Session 3 (Machine Learn...gdgsurrey
 

Recently uploaded (20)

t-test Parametric test Biostatics and Research Methodology
t-test Parametric test Biostatics and Research Methodologyt-test Parametric test Biostatics and Research Methodology
t-test Parametric test Biostatics and Research Methodology
 
Arti Languages Pre Seed Send Ahead Pitchdeck 2024.pdf
Arti Languages Pre Seed Send Ahead Pitchdeck 2024.pdfArti Languages Pre Seed Send Ahead Pitchdeck 2024.pdf
Arti Languages Pre Seed Send Ahead Pitchdeck 2024.pdf
 
25 CHUYÊN ĐỀ ÔN THI TỐT NGHIỆP THPT 2023 – BÀI TẬP PHÁT TRIỂN TỪ ĐỀ MINH HỌA...
25 CHUYÊN ĐỀ ÔN THI TỐT NGHIỆP THPT 2023 – BÀI TẬP PHÁT TRIỂN TỪ ĐỀ MINH HỌA...25 CHUYÊN ĐỀ ÔN THI TỐT NGHIỆP THPT 2023 – BÀI TẬP PHÁT TRIỂN TỪ ĐỀ MINH HỌA...
25 CHUYÊN ĐỀ ÔN THI TỐT NGHIỆP THPT 2023 – BÀI TẬP PHÁT TRIỂN TỪ ĐỀ MINH HỌA...
 
30-de-thi-vao-lop-10-mon-tieng-anh-co-dap-an.doc
30-de-thi-vao-lop-10-mon-tieng-anh-co-dap-an.doc30-de-thi-vao-lop-10-mon-tieng-anh-co-dap-an.doc
30-de-thi-vao-lop-10-mon-tieng-anh-co-dap-an.doc
 
AUDIENCE THEORY - PARTICIPATORY - JENKINS.pptx
AUDIENCE THEORY - PARTICIPATORY - JENKINS.pptxAUDIENCE THEORY - PARTICIPATORY - JENKINS.pptx
AUDIENCE THEORY - PARTICIPATORY - JENKINS.pptx
 
2024.03.16 How to write better quality materials for your learners ELTABB San...
2024.03.16 How to write better quality materials for your learners ELTABB San...2024.03.16 How to write better quality materials for your learners ELTABB San...
2024.03.16 How to write better quality materials for your learners ELTABB San...
 
The OERs: Transforming Education for Sustainable Future by Dr. Sarita Anand
The OERs: Transforming Education for Sustainable Future by Dr. Sarita AnandThe OERs: Transforming Education for Sustainable Future by Dr. Sarita Anand
The OERs: Transforming Education for Sustainable Future by Dr. Sarita Anand
 
LEAD6001 - Introduction to Advanced Stud
LEAD6001 - Introduction to Advanced StudLEAD6001 - Introduction to Advanced Stud
LEAD6001 - Introduction to Advanced Stud
 
DLL Catch Up Friday March 22.docx CATCH UP FRIDAYS
DLL Catch Up Friday March 22.docx CATCH UP FRIDAYSDLL Catch Up Friday March 22.docx CATCH UP FRIDAYS
DLL Catch Up Friday March 22.docx CATCH UP FRIDAYS
 
Least Significance Difference:Biostatics and Research Methodology
Least Significance Difference:Biostatics and Research MethodologyLeast Significance Difference:Biostatics and Research Methodology
Least Significance Difference:Biostatics and Research Methodology
 
Awards Presentation 2024 - March 12 2024
Awards Presentation 2024 - March 12 2024Awards Presentation 2024 - March 12 2024
Awards Presentation 2024 - March 12 2024
 
Dhavni Theory by Anandvardhana Indian Poetics
Dhavni Theory by Anandvardhana Indian PoeticsDhavni Theory by Anandvardhana Indian Poetics
Dhavni Theory by Anandvardhana Indian Poetics
 
The basics of sentences session 8pptx.pptx
The basics of sentences session 8pptx.pptxThe basics of sentences session 8pptx.pptx
The basics of sentences session 8pptx.pptx
 
Metabolism , Metabolic Fate& disorders of cholesterol.pptx
Metabolism , Metabolic Fate& disorders of cholesterol.pptxMetabolism , Metabolic Fate& disorders of cholesterol.pptx
Metabolism , Metabolic Fate& disorders of cholesterol.pptx
 
3.14.24 Gender Discrimination and Gender Inequity.pptx
3.14.24 Gender Discrimination and Gender Inequity.pptx3.14.24 Gender Discrimination and Gender Inequity.pptx
3.14.24 Gender Discrimination and Gender Inequity.pptx
 
3.12.24 Freedom Summer in Mississippi.pptx
3.12.24 Freedom Summer in Mississippi.pptx3.12.24 Freedom Summer in Mississippi.pptx
3.12.24 Freedom Summer in Mississippi.pptx
 
Research Methodology and Tips on Better Research
Research Methodology and Tips on Better ResearchResearch Methodology and Tips on Better Research
Research Methodology and Tips on Better Research
 
3.14.24 The Selma March and the Voting Rights Act.pptx
3.14.24 The Selma March and the Voting Rights Act.pptx3.14.24 The Selma March and the Voting Rights Act.pptx
3.14.24 The Selma March and the Voting Rights Act.pptx
 
PHARMACOGNOSY CHAPTER NO 5 CARMINATIVES AND G.pdf
PHARMACOGNOSY CHAPTER NO 5 CARMINATIVES AND G.pdfPHARMACOGNOSY CHAPTER NO 5 CARMINATIVES AND G.pdf
PHARMACOGNOSY CHAPTER NO 5 CARMINATIVES AND G.pdf
 
Certification Study Group - Professional ML Engineer Session 3 (Machine Learn...
Certification Study Group - Professional ML Engineer Session 3 (Machine Learn...Certification Study Group - Professional ML Engineer Session 3 (Machine Learn...
Certification Study Group - Professional ML Engineer Session 3 (Machine Learn...
 

QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop, Measuring Sample Quality with Kernels - Lester Mackey, Dec 12, 2017

  • 1. Measuring Sample Quality with Kernels Lester Mackey∗,‡ Joint work with Jackson Gorham†,‡ Microsoft Research∗ Opendoor† Stanford University‡ December 12, 2017 Mackey (MSR) Kernel Stein Discrepancy December 12, 2017 1 / 26
  • 2. Motivation: Large-scale Posterior Inference Question: How do we scale Markov chain Monte Carlo (MCMC) posterior inference to massive datasets? MCMC Benefit: Approximates intractable posterior expectations EP [h(Z)] = X p(x)h(x)dx with asymptotically exact sample estimates EQ[h(X)] = 1 n n i=1 h(xi) Problem: Each point xi requires iterating over entire dataset! Template solution: Approximate MCMC with subset posteriors [Welling and Teh, 2011, Ahn, Korattikara, and Welling, 2012, Korattikara, Chen, and Welling, 2014] Approximate standard MCMC procedure in a manner that makes use of only a small subset of datapoints per sample Reduced computational overhead leads to faster sampling and reduced Monte Carlo variance Introduces asymptotic bias: target distribution is not stationary Hope that for fixed amount of sampling time, variance reduction will outweigh bias introduced Mackey (MSR) Kernel Stein Discrepancy December 12, 2017 2 / 26
  • 3. Motivation: Large-scale Posterior Inference Template solution: Approximate MCMC with subset posteriors [Welling and Teh, 2011, Ahn, Korattikara, and Welling, 2012, Korattikara, Chen, and Welling, 2014] Hope that for fixed amount of sampling time, variance reduction will outweigh bias introduced Introduces new challenges How do we compare and evaluate samples from approximate MCMC procedures? How do we select samplers and their tuning parameters? How do we quantify the bias-variance trade-off explicitly? Difficulty: Standard evaluation criteria like effective sample size, trace plots, and variance diagnostics assume convergence to the target distribution and do not account for asymptotic bias This talk: Introduce new quality measures suitable for comparing the quality of approximate MCMC samples Mackey (MSR) Kernel Stein Discrepancy December 12, 2017 3 / 26
  • 4. Quality Measures for Samples Challenge: Develop measure suitable for comparing the quality of any two samples approximating a common target distribution Given Continuous target distribution P with support X = Rd and density p p known up to normalization, integration under P is intractable Sample points x1, . . . , xn ∈ X Define discrete distribution Qn with, for any function h, EQn [h(X)] = 1 n n i=1 h(xi) used to approximate EP [h(Z)] We make no assumption about the provenance of the xi Goal: Quantify how well EQn approximates EP in a manner that I. Detects when a sample sequence is converging to the target II. Detects when a sample sequence is not converging to the target III. Is computationally feasible Mackey (MSR) Kernel Stein Discrepancy December 12, 2017 4 / 26
  • 5. Integral Probability Metrics Goal: Quantify how well EQn approximates EP Idea: Consider an integral probability metric (IPM) [M¨uller, 1997] dH(Qn, P) = sup h∈H |EQn [h(X)] − EP [h(Z)]| Measures maximum discrepancy between sample and target expectations over a class of real-valued test functions H When H sufficiently large, convergence of dH(Qn, P) to zero implies (Qn)n≥1 converges weakly to P (Requirement II) Problem: Integration under P intractable! ⇒ Most IPMs cannot be computed in practice Idea: Only consider functions with EP [h(Z)] known a priori to be 0 Then IPM computation only depends on Qn! How do we select this class of test functions? Will the resulting discrepancy measure track sample sequence convergence (Requirements I and II)? How do we solve the resulting optimization problem in practice? Mackey (MSR) Kernel Stein Discrepancy December 12, 2017 5 / 26
  • 6. Stein’s Method Stein’s method [1972] provides a recipe for controlling convergence: 1 Identify operator T and set G of functions g : X → Rd with EP [(T g)(Z)] = 0 for all g ∈ G. T and G together define the Stein discrepancy [Gorham and Mackey, 2015] S(Qn, T , G) sup g∈G |EQn [(T g)(X)]| = dT G(Qn, P), an IPM-type measure with no explicit integration under P 2 Lower bound S(Qn, T , G) by reference IPM dH(Qn, P ) ⇒ S(Qn, T , G) → 0 only if (Qn)n≥1 converges to P (Req. II) Performed once, in advance, for large classes of distributions 3 Upper bound S(Qn, T , G) by any means necessary to demonstrate convergence to 0 (Requirement I) Standard use: As analytical tool to prove convergence Our goal: Develop Stein discrepancy into practical quality measure Mackey (MSR) Kernel Stein Discrepancy December 12, 2017 6 / 26
  • 7. Identifying a Stein Operator T Goal: Identify operator T for which EP [(T g)(Z)] = 0 for all g ∈ G Approach: Generator method of Barbour [1988, 1990], G¨otze [1991] Identify a Markov process (Zt)t≥0 with stationary distribution P Under mild conditions, its infinitesimal generator (Au)(x) = lim t→0 (E[u(Zt) | Z0 = x] − u(x))/t satisfies EP [(Au)(Z)] = 0 Overdamped Langevin diffusion: dZt = 1 2 log p(Zt)dt + dWt Generator: (AP u)(x) = 1 2 u(x), log p(x) + 1 2 , u(x) Stein operator: (TP g)(x) g(x), log p(x) + , g(x) [Gorham and Mackey, 2015, Oates, Girolami, and Chopin, 2016] Depends on P only through log p; computable even if p cannot be normalized! Multivariate generalization of density method operator (T g)(x) = g(x) d dx log p(x) + g (x) [Stein, Diaconis, Holmes, and Reinert, 2004] Mackey (MSR) Kernel Stein Discrepancy December 12, 2017 7 / 26
  • 8. Identifying a Stein Set G Goal: Identify set G for which EP [(TP g)(Z)] = 0 for all g ∈ G Approach: Reproducing kernels k : X × X → R A reproducing kernel k is symmetric (k(x, y) = k(y, x)) and positive semidefinite ( i,l ciclk(zi, zl) ≥ 0, ∀zi ∈ X, ci ∈ R) e.g., Gaussian kernel k(x, y) = e−1 2 x−y 2 2 Yields function space Kk = {f : f(x) = m i=1 cik(zi, x), m ∈ N} with norm f Kk = m i,l=1 ciclk(zi, zl) We define the kernel Stein set of vector-valued g : X → Rd as Gk, · {g = (g1, . . . , gd) | v ∗ ≤ 1 for vj gj Kk }. Each gj belongs to reproducing kernel Hilbert space (RKHS) Kk Component norms vj gj Kk are jointly bounded by 1 EP [(TP g)(Z)] = 0 for all g ∈ Gk, · whenever k ∈ C (1,1) b and log p integrable under P [Gorham and Mackey, 2017] Mackey (MSR) Kernel Stein Discrepancy December 12, 2017 8 / 26
  • 9. Computing the Kernel Stein Discrepancy Kernel Stein discrepancy (KSD) S(Qn, TP , Gk, · ) Stein operator (TP g)(x) g(x), log p(x) + , g(x) Stein set Gk, · {g = (g1, . . . , gd) | v ∗ ≤ 1 for vj gj Kk } Benefit: Computable in closed form [Gorham and Mackey, 2017] S(Qn, TP , Gk, · ) = w for wj n i,i =1 kj 0(xi, xi ). Reduces to parallelizable pairwise evaluations of Stein kernels kj 0(x, y) 1 p(x)p(y) xj yj (p(x)k(x, y)p(y)) Stein set choice inspired by control functional kernels k0 = d j=1 kj 0 of Oates, Girolami, and Chopin [2016] When · = · 2, recovers the KSD of Chwialkowski, Strathmann, and Gretton [2016], Liu, Lee, and Jordan [2016] To ease notation, will use Gk Gk, · 2 in remainder of the talk Mackey (MSR) Kernel Stein Discrepancy December 12, 2017 9 / 26
  • 10. Detecting Non-convergence Goal: Show S(Qn, TP , Gk) → 0 only if (Qn)n≥1 converges to P Theorem (Univarite KSD detects non-convergence [Gorham and Mackey, 2017]) Suppose P ∈ P and k(x, y) = Φ(x − y) for Φ ∈ C2 with a non-vanishing generalized Fourier transform. If d = 1, then S(Qn, TP , Gk) → 0 only if (Qn)n≥1 converges weakly to P. P is the set of targets P with Lipschitz log p and distant strong log concavity ( log(p(x)/p(y)),y−x x−y 2 2 ≥ k for x − y 2 ≥ r) Includes Gaussian mixtures with common covariance, Bayesian logistic and Student’s t regression with Gaussian priors, ... Justifies use of KSD with Gaussian, Mat´ern, or inverse multiquadric kernels k in the univariate case Mackey (MSR) Kernel Stein Discrepancy December 12, 2017 10 / 26
  • 11. Detecting Non-convergence Goal: Show S(Qn, TP , Gk) → 0 only if Qn converges to P In higher dimensions, KSDs based on common kernels fail to detect non-convergence, even for Gaussian targets P Theorem (KSD fails with light kernel tails [Gorham and Mackey, 2017]) Suppose d ≥ 3, P = N(0, Id), and α (1 2 − 1 d )−1 . If k(x, y) and its derivatives decay at a o( x − y −α 2 ) rate as x − y 2 → ∞, then S(Qn, TP , Gk) → 0 for some (Qn)n≥1 not converging to P. Gaussian (k(x, y) = e−1 2 x−y 2 2 ) and Mat´ern kernels fail for d ≥ 3 Inverse multiquadric kernels (k(x, y) = (1 + x − y 2 2)β ) with β < −1 fail for d > 2β 1+β The violating sample sequences (Qn)n≥1 are simple to construct Problem: Kernels with light tails ignore excess mass in the tails Mackey (MSR) Kernel Stein Discrepancy December 12, 2017 11 / 26
  • 12. The Importance of Tightness Goal: Show S(Qn, TP , Gk) → 0 only if Qn converges to P A sequence (Qn)n≥1 is uniformly tight if for every > 0, there is a finite number R( ) such that supn Qn( X 2 > R( )) ≤ Intuitively, no mass in the sequence escapes to infinity Theorem (KSD detects tight non-convergence [Gorham and Mackey, 2017]) Suppose that P ∈ P and k(x, y) = Φ(x − y) for Φ ∈ C2 with a non-vanishing generalized Fourier transform. If (Qn)n≥1 is uniformly tight and S(Qn, TP , Gk) → 0, then (Qn)n≥1 converges weakly to P. Good news, but, ideally, KSD would detect non-tight sequences automatically... Mackey (MSR) Kernel Stein Discrepancy December 12, 2017 12 / 26
  • 13. Detecting Non-convergence Goal: Show S(Qn, TP , Gk) → 0 only if Qn converges to P Consider the inverse multiquadric (IMQ) kernel k(x, y) = (c2 + x − y 2 2)β for some β < 0, c ∈ R. IMQ KSD fails to detect non-convergence when β < −1 However, IMQ KSD detects non-convergence when β ∈ (−1, 0) Theorem (IMQ KSD detects non-convergence [Gorham and Mackey, 2017]) Suppose P ∈ P and k(x, y) = (c2 + x − y 2 2)β for β ∈ (−1, 0). If S(Qn, TP , Gk) → 0, then (Qn)n≥1 converges weakly to P. Mackey (MSR) Kernel Stein Discrepancy December 12, 2017 13 / 26
  • 14. Detecting Convergence Goal: Show S(Qn, TP , Gk) → 0 when Qn converges to P Proposition (KSD detects convergence [Gorham and Mackey, 2017]) If k ∈ C (2,2) b and log p Lipschitz and square integrable under P, then S(Qn, TP , Gk) → 0 whenever the Wasserstein distance dW · 2 (Qn, P) → 0. Covers Gaussian, Mat´ern, IMQ, and other common bounded kernels k Mackey (MSR) Kernel Stein Discrepancy December 12, 2017 14 / 26
  • 15. A Simple Example i.i.d. from mixture target P i.i.d. from single mixture component 101 102 103 104 101 102 103 104 10−2.5 10−2 10−1.5 10−1 10−0.5 100 Number of sample points, n Discrepancyvalue i.i.d. from mixture target P i.i.d. from single mixture component gh=TPg −3 0 3 −3 0 3 0.2 0.4 0.6 0.8 1.0 −1.0 −0.5 0.0 0.5 x Left plot: For target p(x) ∝ e−1 2 (x+1.5)2 + e−1 2 (x−1.5)2 , compare an i.i.d. sample Qn from P and an i.i.d. sample Qn from one component Expect S(Q1:n, TP , Gk) → 0 & S(Q1:n, TP , Gk) → 0 Compare IMQ KSD (β = −1/2, c = 1) with Wasserstein distance Mackey (MSR) Kernel Stein Discrepancy December 12, 2017 15 / 26
  • 16. A Simple Example i.i.d. from mixture target P i.i.d. from single mixture component 101 102 103 104 101 102 103 104 10−2.5 10−2 10−1.5 10−1 10−0.5 100 Number of sample points, n Discrepancyvalue i.i.d. from mixture target P i.i.d. from single mixture component gh=TPg −3 0 3 −3 0 3 0.2 0.4 0.6 0.8 1.0 −1.0 −0.5 0.0 0.5 x Right plot: For n = 103 sample points, (Top) Recovered optimal Stein functions g ∈ Gk (Bottom) Associated test functions h TP g which best discriminate sample Qn from target P Mackey (MSR) Kernel Stein Discrepancy December 12, 2017 16 / 26
  • 17. The Importance of Kernel Choice i.i.d. from target P Off−target sample q q q qqqqqqq q q q qqqqqq q q qqqqqqq q q q qqqqqqq q q q qqqqqq q q qqqqqqq q q q qqqqqq q q q q qqqqqq q q qqqqqqq q q q qqqqqqq q q qqqqqqq q q qqqqqqq q q q qqqqqqq q q qqqqqqq q q qqqqqqq q q q qqqqqqq q q qqqqqqq q q qqqqqqq 10−2 10−1 100 10−1 100 10−2 10−1 100 GaussianMatérnInverseMultiquadric 102 103 104 105 102 103 104 105 Number of sample points, n KernelSteindiscrepancy Dimension q d = 5 d = 8 d = 20 Target P = N(0, Id) Off-target Qn has all xi 2 ≤ 2n1/d log n, xi − xj 2 ≥ 2 log n Gaussian and Mat´ern KSDs driven to 0 by an off-target sequence that does not converge to P IMQ KSD (β = −1 2 , c = 1) does not have this deficiency Mackey (MSR) Kernel Stein Discrepancy December 12, 2017 17 / 26
  • 18. Selecting Sampler Hyperparameters q q q q q ESS (higher is better) KSD (lower is better) 2.0 2.5 3.0 3.5 −0.5 0.0 0.5 0 10−4 10−3 10−2 10−1 Tolerance parameter, ε ε = 0 (n = 230) ε = 10−2 (n = 416) ε = 10−1 (n = 1000) −3 −2 −1 0 1 2 3 −2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2 x1 x2 Approximate slice sampling [DuBois, Korattikara, Welling, and Smyth, 2014] Tolerance parameter controls number of datapoints evaluated too small ⇒ too few sample points generated too large ⇒ sampling from very different distribution Standard MCMC selection criteria like effective sample size (ESS) and asymptotic variance do not account for this bias ESS maximized at tolerance = 10−1 , IMQ KSD minimized at tolerance = 10−2 Mackey (MSR) Kernel Stein Discrepancy December 12, 2017 18 / 26
  • 19. Selecting Samplers Stochastic Gradient Fisher Scoring (SGFS) [Ahn, Korattikara, and Welling, 2012] Approximate MCMC procedure designed for scalability Approximates Metropolis-adjusted Langevin algorithm but does not use Metropolis-Hastings correction Target P is not stationary distribution Goal: Choose between two variants SGFS-f inverts a d × d matrix for each new sample point SGFS-d inverts a diagonal matrix to reduce sampling time MNIST handwritten digits [Ahn, Korattikara, and Welling, 2012] 10000 images, 51 features, binary label indicating whether image of a 7 or a 9 Bayesian logistic regression posterior P Mackey (MSR) Kernel Stein Discrepancy December 12, 2017 19 / 26
  • 20. Selecting Samplers q q q q qq q q qq q q q qqqqqq q q q q 17500 18000 18500 19000 102 102.5 103 103.5 104 104.5 Number of sample points, n IMQkernelSteindiscrepancy Sampler q SGFS−d SGFS−f SGFS−d qq −0.4 −0.3 −0.2 BEST −0.3 −0.2 −0.1 0.0 x7 x51 SGFS−d qq 0.8 0.9 1.0 1.1 1.2 WORST −0.5 −0.4 −0.3 −0.2 −0.1 0.0 x8 x42 SGFS−f qq −0.1 0.0 0.1 0.2 BEST −0.1 0.0 0.1 0.2 x32 x34 SGFS−f qq −0.6 −0.5 −0.4 −0.3 WORST −1.5 −1.4 −1.3 −1.2 −1.1 x2 x25 Left: IMQ KSD quality comparison for SGFS Bayesian logistic regression (no surrogate ground truth used) Right: SGFS sample points (n = 5 × 104 ) with bivariate marginal means and 95% confidence ellipses (blue) that align best and worst with surrogate ground truth sample (red) Both suggest small speed-up of SGFS-d (0.0017s per sample vs. 0.0019s for SGFS-f) outweighed by loss in inferential accuracy Mackey (MSR) Kernel Stein Discrepancy December 12, 2017 20 / 26
  • 21. Beyond Sample Quality Comparison One-sample hypothesis testing Chwialkowski, Strathmann, and Gretton [2016] used the KSD S(Qn, TP , Gk) to test whether a sample was drawn from a target distribution P (see also Liu, Lee, and Jordan [2016]) Test with default Gaussian kernel k experienced considerable loss of power as the dimension d increased We recreate their experiment with IMQ kernel (β = −1 2 , c = 1) For n = 500, generate sample (xi)n i=1 with xi = zi + ui e1 zi iid ∼ N(0, Id) and ui iid ∼ Unif[0, 1]. Target P = N(0, Id). Compare with standard normality test of Baringhaus and Henze [1988] Table: Mean power of multivariate normality tests across 400 simulations d=2 d=5 d=10 d=15 d=20 d=25 B&H 1.0 1.0 1.0 0.91 0.57 0.26 Gaussian 1.0 1.0 0.88 0.29 0.12 0.02 IMQ 1.0 1.0 1.0 1.0 1.0 1.0 Mackey (MSR) Kernel Stein Discrepancy December 12, 2017 21 / 26
  • 22. Beyond Sample Quality Comparison Improving sample quality Given sample points (xi)n i=1, can minimize KSD S( ˜Qn, TP , Gk) over all weighted samples ˜Qn = n i=1 qn(xi)δxi for qn a probability mass function Liu and Lee [2016] do this with Gaussian kernel k(x, y) = e− 1 h x−y 2 2 Bandwidth h set to median of the squared Euclidean distance between pairs of sample points We recreate their experiment with the IMQ kernel k(x, y) = (1 + 1 h x − y 2 2)−1/2 Mackey (MSR) Kernel Stein Discrepancy December 12, 2017 22 / 26
  • 23. Improving Sample Quality q q q q q 10−4.5 10−4 10−3.5 10−3 10−2.5 10−2 2 10 50 75 100 Dimension, d AverageMSE,||EPZ−EQn ~X||2 2 d Sample q Initial Qn Gaussian KSD IMQ KSD MSE averaged over 500 simulations (±2 standard errors) Target P = N(0, Id) Starting sample Qn = 1 n n i=1 δxi for xi iid ∼ P, n = 100. Mackey (MSR) Kernel Stein Discrepancy December 12, 2017 23 / 26
  • 24. Future Directions Many opportunities for future development 1 Improve scalability of KSD while maintaining convergence determining properties Low-rank, sparse, or stochastic approximations of kernel matrix Subsampling of likelihood terms in log p 2 Addressing other inferential tasks Design of control variates [Oates, Girolami, and Chopin, 2016] Variational inference [Liu and Wang, 2016, Liu and Feng, 2016] Training generative adversarial networks [Wang and Liu, 2016] and variational autoencoders [Pu, Gan, Henao, Li, Han, and Carin, 2017] 3 Exploring the impact of Stein operator choice An infinite number of operators T characterize P How is discrepancy impacted? How do we select the best T ? Thm: If log p bounded and k ∈ C (1,1) 0 , then S(Qn, TP , Gk) → 0 for some (Qn)n≥1 not converging to P Diffusion Stein operators (T g)(x) = 1 p(x) , p(x)m(x)g(x) of Gorham, Duncan, Vollmer, and Mackey [2016] may be appropriate for heavy tails Mackey (MSR) Kernel Stein Discrepancy December 12, 2017 24 / 26
  • 25. References I S. Ahn, A. Korattikara, and M. Welling. Bayesian posterior sampling via stochastic gradient Fisher scoring. In Proc. 29th ICML, ICML’12, 2012. A. D. Barbour. Stein’s method and Poisson process convergence. J. Appl. Probab., (Special Vol. 25A):175–184, 1988. ISSN 0021-9002. A celebration of applied probability. A. D. Barbour. Stein’s method for diffusion approximations. Probab. Theory Related Fields, 84(3):297–322, 1990. ISSN 0178-8051. doi: 10.1007/BF01197887. L. Baringhaus and N. Henze. A consistent test for multivariate normality based on the empirical characteristic function. Metrika, 35(1):339–348, 1988. K. Chwialkowski, H. Strathmann, and A. Gretton. A kernel test of goodness of fit. In Proc. 33rd ICML, ICML, 2016. C. DuBois, A. Korattikara, M. Welling, and P. Smyth. Approximate slice sampling for Bayesian posterior inference. In Proc. 17th AISTATS, pages 185–193, 2014. J. Gorham and L. Mackey. Measuring sample quality with Stein’s method. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Adv. NIPS 28, pages 226–234. Curran Associates, Inc., 2015. J. Gorham and L. Mackey. Measuring sample quality with kernels. arXiv:1703.01717, Mar. 2017. J. Gorham, A. Duncan, S. Vollmer, and L. Mackey. Measuring sample quality with diffusions. arXiv:1611.06972, Nov. 2016. F. G¨otze. On the rate of convergence in the multivariate CLT. Ann. Probab., 19(2):724–739, 1991. A. Korattikara, Y. Chen, and M. Welling. Austerity in MCMC land: Cutting the Metropolis-Hastings budget. In Proc. of 31st ICML, ICML’14, 2014. Q. Liu and Y. Feng. Two methods for wild variational inference. arXiv preprint arXiv:1612.00081, 2016. Q. Liu and J. Lee. Black-box importance sampling. arXiv:1610.05247, Oct. 2016. To appear in AISTATS 2017. Q. Liu and D. Wang. Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm. arXiv:1608.04471, Aug. 2016. Q. Liu, J. Lee, and M. Jordan. A kernelized Stein discrepancy for goodness-of-fit tests. In Proc. of 33rd ICML, volume 48 of ICML, pages 276–284, 2016. L. Mackey and J. Gorham. Multivariate Stein factors for a class of strongly log-concave distributions. arXiv:1512.07392, 2015. A. M¨uller. Integral probability metrics and their generating classes of functions. Ann. Appl. Probab., 29(2):pp. 429–443, 1997. Mackey (MSR) Kernel Stein Discrepancy December 12, 2017 25 / 26
  • 26. References II C. Oates and M. Girolami. Control functionals for quasi-monte carlo integration. In A. Gretton and C. C. Robert, editors, Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, volume 51 of Proceedings of Machine Learning Research, pages 56–65, Cadiz, Spain, 09–11 May 2016. PMLR. C. J. Oates, M. Girolami, and N. Chopin. Control functionals for Monte Carlo integration. Journal of the Royal Statistical Society: Series B (Statistical Methodology), pages n/a–n/a, 2016. ISSN 1467-9868. doi: 10.1111/rssb.12185. Y. Pu, Z. Gan, R. Henao, C. Li, S. Han, and L. Carin. Vae learning via stein variational gradient descent. In Advances in Neural Information Processing Systems, pages 4237–4246, 2017. C. Stein. A bound for the error in the normal approximation to the distribution of a sum of dependent random variables. In Proc. 6th Berkeley Symposium on Mathematical Statistics and Probability (Univ. California, Berkeley, Calif., 1970/1971), Vol. II: Probability theory, pages 583–602. Univ. California Press, Berkeley, Calif., 1972. C. Stein, P. Diaconis, S. Holmes, and G. Reinert. Use of exchangeable pairs in the analysis of simulations. In Stein’s method: expository lectures and applications, volume 46 of IMS Lecture Notes Monogr. Ser., pages 1–26. Inst. Math. Statist., Beachwood, OH, 2004. D. Wang and Q. Liu. Learning to Draw Samples: With Application to Amortized MLE for Generative Adversarial Learning. arXiv:1611.01722, Nov. 2016. M. Welling and Y. Teh. Bayesian learning via stochastic gradient Langevin dynamics. In ICML, 2011. Mackey (MSR) Kernel Stein Discrepancy December 12, 2017 26 / 26
  • 27. Comparing Discrepancies qq qq qq qq qq qqqqqqqq qqqq qq qqqq qqqq qqqq qqqq qqqq qqqq qq qq qqqq qq qq qq qq qq qqqqqq qqqqqq qqqqqq qq qqqqqqqqqqqqqqqq qqqqqqqqqqqqqqqqqq qqqq i.i.d. from mixture target P i.i.d. from single mixture component 101 102 103 104 101 102 103 104 10−2.5 10−2 10−1.5 10−1 10−0.5 100 Number of sample points, n Discrepancyvalue Discrepancy q IMQ KSD Graph Stein discrepancy Wasserstein q qqqqqqqqq q q q qqq qqq qqq qq q qq q q q q qq qqqqqq q q q q q q q q q q q q q d = 1 d = 4 101 102 103 104 101 102 103 104 10−3 10−2 10−1 100 101 102 103 Number of sample points, n Computationtime(sec) − − Left: Samples drawn i.i.d. from either the bimodal Gaussian mixture target p(x) ∝ e−1 2 (x+1.5)2 + e−1 2 (x−1.5)2 or a single mixture component. Right: Discrepancy computation time using d cores in d dimensions. Mackey (MSR) Kernel Stein Discrepancy December 12, 2017 27 / 26
  • 28. Selecting Sampler Hyperparameters Setup [Welling and Teh, 2011] Consider the posterior distribution P induced by L datapoints yl drawn i.i.d. from a Gaussian mixture likelihood Yl|X iid ∼ 1 2 N(X1, 2) + 1 2 N(X1 + X2, 2) under Gaussian priors on the parameters X ∈ R2 X1 ∼ N(0, 10) ⊥⊥ X2 ∼ N(0, 1) Draw m = 100 datapoints yl with parameters (x1, x2) = (0, 1) Induces posterior with second mode at (x1, x2) = (1, −1) For range of parameters , run approximate slice sampling for 148000 datapoint likelihood evaluations and store resulting posterior sample Qn Use minimum IMQ KSD (β = −1 2 , c = 1) to select appropriate Compare with standard MCMC parameter selection criterion, effective sample size (ESS), a measure of Markov chain autocorrelation Compute median of diagnostic over 50 random sequences Mackey (MSR) Kernel Stein Discrepancy December 12, 2017 28 / 26
  • 29. Selecting Samplers Setup MNIST handwritten digits [Ahn, Korattikara, and Welling, 2012] 10000 images, 51 features, binary label indicating whether image of a 7 or a 9 Bayesian logistic regression posterior P L independent observations (yl, vl) ∈ {1, −1} × Rd with P(Yl = 1|vl, X) = 1/(1 + exp(− vl, X )) Flat improper prior on the parameters X ∈ Rd Use IMQ KSD (β = −1 2 , c = 1) to compare SGFS-f to SGFS-d drawing 105 sample points and discarding first half as burn-in For external support, compare bivariate marginal means and 95% confidence ellipses with surrogate ground truth Hamiltonian Monte chain with 105 sample points [Ahn, Korattikara, and Welling, 2012] Mackey (MSR) Kernel Stein Discrepancy December 12, 2017 29 / 26