Representing and comparing probabilities: Part 2

Representing and comparing probabilities: Part 2
Arthur Gretton
Gatsby Computational Neuroscience Unit,
University College London
UAI, 2017
1/52

Testing against a probabilistic model
2/52

Statistical model criticism
MMD@P;QA a kf £k2 a supkf kF 1‘EQ f Epf “
-4 -2 2 4
-0.3
-0.2
-0.1
0.1
0.2
0.3
0.4
p(x)
q(x)
f *
(x)
f £@xA is the witness function
Can we compute MMD with samples from Q and a model P?
Problem: usualy can’t compute Epf in closed form.
3/52

Stein idea
To get rid of Epf in
sup
kf kF 1
‘Eq f Epf “
we deﬁne the Stein operator
‘Tpf “@xA a 1
p@xA
d
dx
@f @xAp@xAA
Then
EP TP f a 0
subject to appropriate boundary conditions. (Oates, Girolami, Chopin, 2016)
4/52

Stein idea: proof
Ep ‘Tpf “ a
Z
1
p@xA
d
dx
@f @xAp@xAA

p@xAdx
Z
d
dx
@f @xAp@xAA

dx
a ‘f @xAp@xA“I
I
a 0
5/52

Stein idea: proof
Ep ‘Tpf “ a
Z
1
¨¨¨p@xA
d
dx
@f @xAp@xAA

¨
¨¨p@xAdx
Z
d
dx
@f @xAp@xAA

dx
a ‘f @xAp@xA“I
I
a 0
5/52

Kernel Stein Discrepancy
Stein operator
Tpf a @x f Cf @x @log pA
Kernel Stein Discrepancy (KSD)
KSD@p;q;pA a sup
kgkF 1
Eq Tpg EpTpg
6/52

Stein operator
KSD@p;q;pA a sup
kgkF 1
Eq Tpg $$$$EpTpg a sup
kgkF 1
Eq Tpg
6/52

Stein operator
KSD@p;q;pA a sup
kgkF 1
kgkF 1
Eq Tpg
-4 -2 2 4
-0.6
-0.4
-0.2
0.2
0.4
p(x)
q(x)
g*
(x)
6/52

Stein operator
KSD@p;q;pA a sup
kgkF 1
kgkF 1
Eq Tpg
-4 -2 2 4
0.1
0.2
0.3
0.4
p(x)
q(x)
g*
(x)
6/52

Kernel stein discrepancy
Closed-form expression for KSD: given Z;ZH $ q, then
(Chwialkowski, Strathmann, G., ICML 2016) (Liu, Lee, Jordan ICML 2016)
KSD@p;q;pA a Eq hp@Z;ZHA
where
hp@x;yA Xa @x log p@xA@x log p@yAk@x;yA
C@y log p@yA@x k@x;yA
C@x log p@xA@y k@x;yA
C@x @y k@x;yA
and k is RKHS kernel for p
Only depends on kernel and @x log p(x). Do not need to
normalize p, or sample from it.
7/52

Chicago crime data
8/52

Chicago crime data
Model is Gaussian mixture with two components.
8/52

Chicago crime data
Model is Gaussian mixture with two components
Stein witness function
8/52

Chicago crime data
Model is Gaussian mixture with ten components.
8/52

Chicago crime data
Model is Gaussian mixture with ten components
Stein witness function
Code: https://github.com/karlnapf/kernel_goodness_of_ﬁt 8/52

Kernel stein discrepancy
Further applications:
Evaluation of approximate MCMC methods.
(Chwialkowski, Strathmann, G., ICML 2016; Gorham, Mackey, ICML 2017)
What kernel to use?
The inverse multiquadric kernel,
k@x;yA a

c Ckx yk2
2

Testing statistical dependence
10/52

Dependence testing
Given: Samples from a distribution PX Y
Goal: Are X and Y independent?
Their noses guide them
through life, and they're
never happier than when
following an interesting scent.
A large animal who slings slobber,
exudes a distinctive houndy odor,
and wants nothing more than to
follow his nose.
Text from dogtime.com and petfinder.com
A responsive, interactive
pet, one that will blow in
your ear and follow you
everywhere.
YX
11/52

MMD as a dependence measure?
Could we use MMD?
MMD@PXY
| {z }
P
;PX PY
| {z }
Q
;rA
We don’t have samples from Q Xa PX PY , only pairs
f@xi ;yi gn
i=1
i:i:d:
$ PXY
Solution: simulate Q with pairs (xi ;yj ) for j T= i.
What kernel to use for the RKHS r?
12/52

MMD as a dependence measure
Kernel k on images with feature space p,
Kernel l on captions with feature space q,
13/52

Kernel k on images with feature space p,
Kernel l on captions with feature space q,
Kernel on image-text pairs: are images and captions similar?
13/52

MMD2
@bPXY ; bPX
bPY ;rA Xa 1
n2
trace@KLA
( K, L column centered)
14/52

MMD2
@bPXY ; bPX
bPY ;rA Xa 1
n2
trace@KLA
14/52

Two questions:
Why the product kernel? Many ways to combine kernels - why not
eg a sum?
Is there a more interpretable way of deﬁning this dependence
measure?
15/52

Finding covariance with smooth transformations
Illustration: two variables with no correlation but strong dependence.
-2 -1 0 1 2
-1.5
-1
-0.5
0
0.5
1
1.5
Correlation: 0.00
16/52

-2 -1 0 1 2
-1.5
-1
-0.5
0
0.5
1
1.5
Correlation: 0.00
-2 0 2
-1
-0.5
0
0.5
-2 0 2
-1
-0.5
0
0.5
16/52

-2 -1 0 1 2
-1.5
-1
-0.5
0
0.5
1
1.5
Correlation: 0.00
-2 0 2
-1
-0.5
0
0.5
-2 0 2
-1
-0.5
0
0.5
-1 -0.5 0 0.5
-1
-0.5
0
0.5
Correlation: 0.90
16/52

Deﬁne two spaces, one for each witness
Function in p
f @xA a
IX
j =1
fj 'j @xA
Feature map
Kernel for RKHS p on ˆ:
k@x;xHA a h'@xA;'@xHAip
Function in q
g@yA a
IX
j =1
gj j @yA
Feature map
Kernel for RKHS q on ‰:
l@x;xHA a h@yA;@yHAiq
17/52

The constrained covariance
The constrained covariance is
COCO@PXY A a sup
kf kp 1
kgkq 1
cov‘f @xAg@yA“
-2 -1 0 1 2
-1.5
-1
-0.5
0
0.5
1
1.5
Correlation: 0.00
-2 0 2
-1
-0.5
0
0.5
-2 0 2
-1
-0.5
0
0.5
-1 -0.5 0 0.5
-1
-0.5
0
0.5
Correlation: 0.90
18/52

COCO@PXY A a sup
kf kp 1
kgkq 1
cov
2
4
0
@
IX
j =1
fj 'j @xA
1
A
0
@
IX
j =1
gj j @yA
1
A
3
5
18/52

COCO@PXY A a sup
kf kp 1
kgkq 1
Exy
2
4
0
@
IX
j =1
fj 'j @xA
1
A
0
@
IX
j =1
gj j @yA
1
A
3
5
Fine print: feature mappings '(x) and (y) assumed to have zero mean.
18/52

COCO@PXY A a sup
kf kp 1
kgkq 1
Exy
2
4
0
@
IX
j =1
fj 'j @xA
1
A
0
@
IX
j =1
gj j @yA
1
A
3
5
Rewriting:
Exy ‘f @xAg@yA“
a
2
6
6
4
f1
f2
...
3
7
7
5
b
Exy
0
B
B
@
2
6
6
4
'1@xA
'2@xA
...
3
7
7
5
h
1@yA 2@yA :::
i
1
C
C
A
| {z }
C'(x)(y)
2
6
6
4
g1
g2
...
3
7
7
5
18/52

COCO@PXY A a sup
kf kp 1
kgkq 1
Exy
2
4
0
@
IX
j =1
fj 'j @xA
1
A
0
@
IX
j =1
gj j @yA
1
A
3
5
Rewriting:
Exy ‘f @xAg@yA“
a
2
6
6
4
f1
f2
...
3
7
7
5
b
Exy
0
B
B
@
2
6
6
4
'1@xA
'2@xA
...
3
7
7
5
h
1@yA 2@yA :::
i
1
C
C
A
| {z }
C'(x)(y)
2
6
6
4
g1
g2
...
3
7
7
5
COCO: max singular value of feature covariance C'(x)(y)
18/52

Computing COCO in practice
Given sample f@xi ;yi Agn
i=1
i:i:d:
$ PXY , what is empirical COCO ?
19/52

Computing COCO in practice
Given sample f@xi ;yi Agn
i=1
i:i:d:
$ PXY , what is empirical COCO ?
COCO is largest eigenvalue
max of

0 1
n KL
1
n LK 0
#

#
:
Kij a k@xi ;xj A and Lij a l@yi ;yj A.
Fine print: kernels are computed with empirically centered features '(x) 1
n
Pn
i=1
'(xi )
and (y) 1
n
Pn
i=1
(yi ).
AG., A. Smola., O. Bousquet, R. Herbrich, A. Belitski, M. Augath, Y. Murayama, J. Pauls, B.
Schoelkopf, and N. Logothetis, AISTATS’05
19/52

#
:
Kij a k@xi ;xj A and Lij a l@yi ;yj A.
Witness functions (singular vectors):
f @xA G
mX
i=1
i k@xi ;xA g@yA G
nX
i=1

i l@yi ;yA
Fine print: kernels are computed with empirically centered features '(x) 1
n
Pn
i=1
'(xi )
and (y) 1
n
Pn
i=1
(yi ).
AG., A. Smola., O. Bousquet, R. Herbrich, A. Belitski, M. Augath, Y. Murayama, J. Pauls, B.
Schoelkopf, and N. Logothetis, AISTATS’05
19/52

What is a large dependence with COCO?
−2 0 2
−3
−2
−1
0
1
2
3
X
Y
Smooth density
−4 −2 0 2 4
−4
−2
0
2
4
X
Y
500 Samples, smooth density
−2 0 2
−3
−2
−1
0
1
2
3
X
Y
Rough density
−4 −2 0 2 4
−4
−2
0
2
4
X
Y
500 samples, rough density
Density takes the form:
PXY G 1Csin@!xAsin@!yA
Which of these is the more “dependent”?
20/52

Case of ! a 1:
-4 -2 0 2 4
-4
-2
0
2
4
Correlation: 0.31
-2 0 2
-1
-0.5
0
0.5
1
-2 0 2
-1
-0.5
0
0.5
1
-0.5 0 0.5
-0.5
0
0.5
Correlation: 0.50 COCO: 0.09
21/52

Case of ! a 2:
-4 -2 0 2 4
-4
-2
0
2
4
Correlation: 0.02
-2 0 2
-1
-0.5
0
0.5
1
-2 0 2
-1
-0.5
0
0.5
1
-0.5 0 0.5
-0.5
0
0.5
22/52

Case of ! a 3:
-4 -2 0 2 4
-4
-2
0
2
4
Correlation: 0.03
-2 0 2
-1
-0.5
0
0.5
1
-2 0 2
-1
-0.5
0
0.5
1
-0.5 0 0.5
-0.5
0
0.5
23/52

Case of ! a 4:
-4 -2 0 2 4
-4
-2
0
2
4
Correlation: 0.05
-2 0 2
-1
-0.5
0
0.5
1
-2 0 2
-1
-0.5
0
0.5
1
-0.5 0 0.5
-0.5
0
0.5
24/52

Case of ! acc:
-4 -2 0 2 4
-4
-2
0
2
4
Correlation: 0.01
-2 0 2
-1
-0.5
0
0.5
1
-2 0 2
-1
-0.5
0
0.5
1
-0.5 0 0.5
-0.5
0
0.5
25/52

Case of ! a 0: uniform noise! (shows bias)
-4 -2 0 2 4
-4
-2
0
2
4
Correlation: 0.01
-2 0 2
-1
-0.5
0
0.5
1
-2 0 2
-1
-0.5
0
0.5
1
-0.5 0 0.5
-0.5
0
0.5
26/52

Dependence largest when at “low” frequencies
As dependence is encoded at higher frequencies, the smooth
mappings f ;g achieve lower linear dependence.
Even for independent variables, COCO will not be zero at ﬁnite
sample sizes, since some mild linear dependence will be found by f,g
(bias)
This bias will decrease with increasing sample size.
27/52

Can we do better than COCO?
A second example with zero correlation.
First singular value of feature covariance C'(x)(y):
-1 -0.5 0 0.5 1
-1
-0.5
0
0.5
1
Correlation: 0.00
-2 0 2
-1
-0.5
0
0.5
-2 0 2
-1
-0.5
0
0.5
-1 -0.5 0 0.5
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
Correlation: 0.80 COCO1
: 0.11
28/52

Second singular value of feature covariance C'(x)(y):
-1 -0.5 0 0.5 1
-1
-0.5
0
0.5
1
Correlation: 0.00
-2 0 2
-1
-0.5
0
0.5
1
-2 0 2
-1
-0.5
0
0.5
1
28/52

Second singular value of feature covariance C'(x)(y):
-1 -0.5 0 0.5 1
-1
-0.5
0
0.5
1
Correlation: 0.00
-2 0 2
-1
-0.5
0
0.5
1
-2 0 2
-1
-0.5
0
0.5
1
-1 -0.5 0 0.5 1
-0.5
0
0.5
Correlation: 0.37 COCO2
: 0.06
28/52

The Hilbert-Schmidt Independence Criterion
Writing the ith singular value of the feature covariance C'(x)(y) as

i Xa COCOi @PXY Yp;qA;
deﬁne Hilbert-Schmidt Independence Criterion (HSIC)
HSIC2
@PXY Yp;qA a
IX
i=1

2
i :
AG, O. Bousquet , A. Smola., and B. Schoelkopf, ALT2005; AG,.,K. Fukumizu„C.H. Teo., L. Song., B.
Schoelkopf., and A. Smola, NIPS 2007,.
29/52

The Hilbert-Schmidt Independence Criterion
Writing the ith singular value of the feature covariance C'(x)(y) as

i Xa COCOi @PXY Yp;qA;
deﬁne Hilbert-Schmidt Independence Criterion (HSIC)
HSIC2
@PXY Yp;qA a
IX
i=1

2
i :
AG, O. Bousquet , A. Smola., and B. Schoelkopf, ALT2005; AG,.,K. Fukumizu„C.H. Teo., L. Song., B.
Schoelkopf., and A. Smola, NIPS 2007,.
HSIC is MMD with product kernel!
HSIC2
@PXY Yp;qA a MMD2
@PXY ;PX PY YrA
where @@x;yA;@xH;yHAA a k@x;xHAl@y;yHA.
29/52

Asymptotics of HSIC under independence
Given sample f@xi ;yi gn
i=1
i:i:d:
$ PXY , what is empirical HSIC?
Empirical HSIC (biased)
HSIC a 1
n2
trace@KLA
Kij a k@xi ;xj A and Lij a l@yi yj A (K and L computed with
empirically centered features)
Statistical testing: given PXY a PX PY , what is the threshold c
such that P@HSIC cA for small ?
Asymptotics of HSIC when PXY a PX PY :
n HSIC
D
3
IX
l=1
l z2
l ; zl $ x@0;1Ai:i:d:
where l l (zj ) =
R
hijqr l (zi )dFi;q;r ; hijqr = 1
4!
P(i;j ;q;r)
(t;u;v;w)
ktu ltu + ktu lvw 2ktu ltv
30/52

A statistical test
Given PXY a PX PY , what is the threshold c such that
P@HSIC cA for small (prob. of false positive)?
Original time series:
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
Y1 Y2 Y3 Y4 Y5 Y6 Y7 Y8 Y9 Y10
Permutation:
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
Y7 Y3 Y9 Y2 Y4 Y8 Y5 Y1 Y6 Y10
Null distribution via permutation
Compute HSIC for fxi ;y(i)gn
i=1 for random permutation of indices
f1;:::;ng. This gives HSIC for independent variables.
Repeat for many diﬀerent permutations, get empirical CDF
Threshold c is 1 quantile of empirical CDF 31/52

Application: dependence detection across languages
Testing task: detect dependence between English and French text
Les ordres de gouvernements
provinciaux et municipaux
subissent de fortes pressions
Honourable senators, I have a
question for the Leader of the
Government in the Senate
Text from the aligned hansards of the 36th parliament of canada,
https://www.isi.edu/natural-language/download/hansard/
YX
Honorables sénateurs, ma question
s’adresse au leader du
gouvernement au Sénat
Au contraire, nous avons augmenté
le financement fédéral pour le
développement des jeunes
No doubt there is great pressure
on provincial and municipal
governments
In fact, we have increased
federal investments for early
childhood development.
...
...
32/52

Application: dependence detection across languages
Testing task: detect dependence between English and French text
k-spectrum kernel, k a 10, sample size n a 10
HSIC a 1
n2
trace@KLA
(K and L column centered) 33/52

Application:Dependence detection across languages
Results (for a 0:05)
k-spectrum kernel: average Type II error 0
Bag of words kernel: average Type II error 0.18
Settings: Five line extracts, averaged over 300 repetitions, for
“Agriculture” transcripts. Similar results for Fisheries and
Immigration transcripts.
34/52

Testing higher order interactions
35/52

Detecting higher order interaction
How to detect V-structures with pairwise weak individual
dependence?
X Y
Z
36/52

dependence?
36/52

dependence?
X cc Y ;Y cc Z;X cc Z
X1 vs Y1 Y1 vs Z1
X1 vs Z1 X1*Y1 vs Z1
X Y
Z
X ;Y
i:i:d:
$ x@0;1A
Zj X ;Y $ sign@XY AExp@ 1p
2
A
Fine print: Faithfulness violated here!
36/52

V-structure discovery
X Y
Z
Assume X cc Y has been established.
V-structure can then be detected by:
Consistent CI test: H0 X X cc Y jZ [Fukumizu et al. 2008, Zhang et al. 2011]
Factorisation test: H0 X @X ;Y A cc Z • @X ;ZA cc Y • @Y ;ZA cc X
(multiple standard two-variable tests)
How well do these work?
37/52

Generalise earlier example to p dimensions
X cc Y ;Y cc Z;X cc Z
X1 vs Y1 Y1 vs Z1
X1 vs Z1 X1*Y1 vs Z1
X Y
Z
X ;Y
i:i:d:
$ x@0;1A
Zj X ;Y $ sign@XY AExp@ 1p
2
A
X2:p;Y2:p;Z2:p
i:i:d:
$ x@0;Ip 1A
Fine print: Faithfulness violated here!
38/52

CI test for X cc Y jZ from Zhang et al. (2011), and a factorisation
test, n a 500
39/52

Lancaster interaction measure
Lancaster interaction measure of @X1;:::;XD A $ P is a signed
measure ¡P that vanishes whenever P can be factorised non-trivially.
D a 2 X ¡LP a PXY PX PY
40/52

D a 3 X ¡LP a PXYZ PX PYZ PY PXZ PZ PXY C2PX PY PZ
40/52

X Y
Z
X Y
Z
X Y
Z
X Y
Z
PXY Z −PXPY Z −PY PXZ −PZPXY +2PXPY PZ
∆LP =
40/52

X Y
Z
X Y
Z
X Y
Z
X Y
Z
PXY Z −PXPY Z −PXZPY −PXY PZ +2PXPY PZ
∆LP = 0
Case of PX cc PYZ
40/52

@X ;Y A cc Z • @X ;ZA cc Y • @Y ;ZA cc X A ¡LP a 0:
...so what might be missed?
40/52

¡LP a 0 ;@X ;Y A cc Z • @X ;ZA cc Y • @Y ;ZA cc X
Example:
P(0;0;0) = 0:2 P(0;0;1) = 0:1 P(1;0;0) = 0:1 P(1;0;1) = 0:1
P(0;1;0) = 0:1 P(0;1;1) = 0:1 P(1;1;0) = 0:1 P(1;1;1) = 0:2
40/52

A kernel test statistic using Lancaster Measure
Construct a test by estimating k @¡LPAk2
r ; where a k l m:
k@PXYZ PXY PZ ¡¡¡Ak2
r a
hPXYZ ;PXYZ ir 2 hPXYZ ;PXY PZ ir ¡¡¡
41/52

A kernel test statistic using Lancaster Measure
Table: V -statistic estimators of h;Hir
(without terms PX PY PZ ). H
is centering matrix I n 1
Lancaster interaction statistic: D. Sejdinovic, AG, W. Bergsma, NIPS13
k @¡LPAk2
r a 1
n2
@HKH HLH HMHA++ :
Empirical joint central moment in the feature space
42/52

Lancaster test, CI test for X cc Y jZ from Zhang et al. (2011), and a
factorisation test, n a 500 43/52

Interaction for D 4
Interaction measure valid for all D:
(Streitberg, 1990)
¡S P a
X

@ 1Ajj 1
@jj 1A3JP
For a partition , J associates to the
joint the corresponding factorisation,
e.g., J13j2j4P = PX1X3
PX2
PX4
:
44/52

Interaction for D 4
Interaction measure valid for all D:
(Streitberg, 1990)
¡S P a
X

@ 1Ajj 1
@jj 1A3JP
For a partition , J associates to the
joint the corresponding factorisation,
e.g., J13j2j4P = PX1X3
PX2
PX4
:
1e+04
1e+09
1e+14
1e+19
1 3 5 7 9 11 13 15 17 19 21 23 25
D
Numberofpartitionsof{1,...,D}
Bell numbers growth
44/52

Representing and comparing probabilities: Part 2

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Representing and comparing probabilities: Part 2

Similar to Representing and comparing probabilities: Part 2 (20)

More from MLReview

More from MLReview (12)

Recently uploaded

Recently uploaded (20)

Representing and comparing probabilities: Part 2