SlideShare a Scribd company logo
Representing and comparing probabilities: Part 2
Arthur Gretton
Gatsby Computational Neuroscience Unit,
University College London
UAI, 2017
1/52
Testing against a probabilistic model
2/52
Statistical model criticism
MMD@P;QA a kf £k2 a supkf kF 1‘EQ f  Epf “
-4 -2 2 4
-0.3
-0.2
-0.1
0.1
0.2
0.3
0.4
p(x)
q(x)
f *
(x)
f £@xA is the witness function
Can we compute MMD with samples from Q and a model P?
Problem: usualy can’t compute Epf in closed form.
3/52
Stein idea
To get rid of Epf in
sup
kf kF 1
‘Eq f  Epf “
we define the Stein operator
‘Tpf “@xA a 1
p@xA
d
dx
@f @xAp@xAA
Then
EP TP f a 0
subject to appropriate boundary conditions. (Oates, Girolami, Chopin, 2016)
4/52
Stein idea: proof
Ep ‘Tpf “ a
Z 
1
p@xA
d
dx
@f @xAp@xAA

p@xAdx
Z 
d
dx
@f @xAp@xAA

dx
a ‘f @xAp@xA“I
 I
a 0
5/52
Stein idea: proof
Ep ‘Tpf “ a
Z 
1
¨¨¨p@xA
d
dx
@f @xAp@xAA

¨
¨¨p@xAdx
Z 
d
dx
@f @xAp@xAA

dx
a ‘f @xAp@xA“I
 I
a 0
5/52
Stein idea: proof
Ep ‘Tpf “ a
Z 
1
¨¨¨p@xA
d
dx
@f @xAp@xAA

¨
¨¨p@xAdx
Z 
d
dx
@f @xAp@xAA

dx
a ‘f @xAp@xA“I
 I
a 0
5/52
Stein idea: proof
Ep ‘Tpf “ a
Z 
1
¨¨¨p@xA
d
dx
@f @xAp@xAA

¨
¨¨p@xAdx
Z 
d
dx
@f @xAp@xAA

dx
a ‘f @xAp@xA“I
 I
a 0
5/52
Stein idea: proof
Ep ‘Tpf “ a
Z 
1
¨¨¨p@xA
d
dx
@f @xAp@xAA

¨
¨¨p@xAdx
Z 
d
dx
@f @xAp@xAA

dx
a ‘f @xAp@xA“I
 I
a 0
5/52
Kernel Stein Discrepancy
Stein operator
Tpf a @x f Cf @x @log pA
Kernel Stein Discrepancy (KSD)
KSD@p;q;pA a sup
kgkF 1
Eq Tpg  EpTpg
6/52
Kernel Stein Discrepancy
Stein operator
Tpf a @x f Cf @x @log pA
Kernel Stein Discrepancy (KSD)
KSD@p;q;pA a sup
kgkF 1
Eq Tpg  $$$$EpTpg a sup
kgkF 1
Eq Tpg
6/52
Kernel Stein Discrepancy
Stein operator
Tpf a @x f Cf @x @log pA
Kernel Stein Discrepancy (KSD)
KSD@p;q;pA a sup
kgkF 1
Eq Tpg  $$$$EpTpg a sup
kgkF 1
Eq Tpg
-4 -2 2 4
-0.6
-0.4
-0.2
0.2
0.4
p(x)
q(x)
g*
(x)
6/52
Kernel Stein Discrepancy
Stein operator
Tpf a @x f Cf @x @log pA
Kernel Stein Discrepancy (KSD)
KSD@p;q;pA a sup
kgkF 1
Eq Tpg  $$$$EpTpg a sup
kgkF 1
Eq Tpg
-4 -2 2 4
0.1
0.2
0.3
0.4
p(x)
q(x)
g*
(x)
6/52
Kernel stein discrepancy
Closed-form expression for KSD: given Z;ZH $ q, then
(Chwialkowski, Strathmann, G., ICML 2016) (Liu, Lee, Jordan ICML 2016)
KSD@p;q;pA a Eq hp@Z;ZHA
where
hp@x;yA Xa @x log p@xA@x log p@yAk@x;yA
C@y log p@yA@x k@x;yA
C@x log p@xA@y k@x;yA
C@x @y k@x;yA
and k is RKHS kernel for p
Only depends on kernel and @x log p(x). Do not need to
normalize p, or sample from it.
7/52
Statistical model criticism
Chicago crime data
8/52
Statistical model criticism
Chicago crime data
Model is Gaussian mixture with two components.
8/52
Statistical model criticism
Chicago crime data
Model is Gaussian mixture with two components
Stein witness function
8/52
Statistical model criticism
Chicago crime data
Model is Gaussian mixture with ten components.
8/52
Statistical model criticism
Chicago crime data
Model is Gaussian mixture with ten components
Stein witness function
Code: https://github.com/karlnapf/kernel_goodness_of_fit 8/52
Kernel stein discrepancy
Further applications:
Evaluation of approximate MCMC methods.
(Chwialkowski, Strathmann, G., ICML 2016; Gorham, Mackey, ICML 2017)
What kernel to use?
The inverse multiquadric kernel,
k@x;yA a

c Ckx  yk2
2
for
P @ 1;0A.
9/52
Testing statistical dependence
10/52
Dependence testing
Given: Samples from a distribution PX Y
Goal: Are X and Y independent?
Their	noses	guide	them	
through	life,	and	they're	
never	happier	than	when	
following	an	interesting	scent.	
A	large	animal	who	slings	slobber,	
exudes	a	distinctive	houndy odor,	
and	wants	nothing	more	than	to	
follow	his	nose.	
Text	from	dogtime.com and	petfinder.com
A responsive,		interactive	
pet,	one	that	will	blow	in	
your	ear	and	follow	you	
everywhere.
YX
11/52
MMD as a dependence measure?
Could we use MMD?
MMD@PXY
| {z }
P
;PX PY
| {z }
Q
;rA
We don’t have samples from Q Xa PX PY , only pairs
f@xi ;yi gn
i=1
i:i:d:
$ PXY
 Solution: simulate Q with pairs (xi ;yj ) for j T= i.
What kernel  to use for the RKHS r?
12/52
MMD as a dependence measure?
Could we use MMD?
MMD@PXY
| {z }
P
;PX PY
| {z }
Q
;rA
We don’t have samples from Q Xa PX PY , only pairs
f@xi ;yi gn
i=1
i:i:d:
$ PXY
 Solution: simulate Q with pairs (xi ;yj ) for j T= i.
What kernel  to use for the RKHS r?
12/52
MMD as a dependence measure?
Could we use MMD?
MMD@PXY
| {z }
P
;PX PY
| {z }
Q
;rA
We don’t have samples from Q Xa PX PY , only pairs
f@xi ;yi gn
i=1
i:i:d:
$ PXY
 Solution: simulate Q with pairs (xi ;yj ) for j T= i.
What kernel  to use for the RKHS r?
12/52
MMD as a dependence measure
Kernel k on images with feature space p,
Kernel l on captions with feature space q,
13/52
MMD as a dependence measure
Kernel k on images with feature space p,
Kernel l on captions with feature space q,
Kernel  on image-text pairs: are images and captions similar?
13/52
MMD as a dependence measure
Given: Samples from a distribution PX Y
Goal: Are X and Y independent?
MMD2
@bPXY ; bPX
bPY ;rA Xa 1
n2
trace@KLA
( K, L column centered)
14/52
MMD as a dependence measure
Given: Samples from a distribution PX Y
Goal: Are X and Y independent?
MMD2
@bPXY ; bPX
bPY ;rA Xa 1
n2
trace@KLA
14/52
MMD as a dependence measure
Two questions:
Why the product kernel? Many ways to combine kernels - why not
eg a sum?
Is there a more interpretable way of defining this dependence
measure?
15/52
Finding covariance with smooth transformations
Illustration: two variables with no correlation but strong dependence.
-2 -1 0 1 2
-1.5
-1
-0.5
0
0.5
1
1.5
Correlation: 0.00
16/52
Finding covariance with smooth transformations
Illustration: two variables with no correlation but strong dependence.
-2 -1 0 1 2
-1.5
-1
-0.5
0
0.5
1
1.5
Correlation: 0.00
-2 0 2
-1
-0.5
0
0.5
-2 0 2
-1
-0.5
0
0.5
16/52
Finding covariance with smooth transformations
Illustration: two variables with no correlation but strong dependence.
-2 -1 0 1 2
-1.5
-1
-0.5
0
0.5
1
1.5
Correlation: 0.00
-2 0 2
-1
-0.5
0
0.5
-2 0 2
-1
-0.5
0
0.5
-1 -0.5 0 0.5
-1
-0.5
0
0.5
Correlation: 0.90
16/52
Define two spaces, one for each witness
Function in p
f @xA a
IX
j =1
fj 'j @xA
Feature map
Kernel for RKHS p on ˆ:
k@x;xHA a h'@xA;'@xHAip
Function in q
g@yA a
IX
j =1
gj j @yA
Feature map
Kernel for RKHS q on ‰:
l@x;xHA a h@yA;@yHAiq
17/52
The constrained covariance
The constrained covariance is
COCO@PXY A a sup
kf kp 1
kgkq 1
cov‘f @xAg@yA“
-2 -1 0 1 2
-1.5
-1
-0.5
0
0.5
1
1.5
Correlation: 0.00
-2 0 2
-1
-0.5
0
0.5
-2 0 2
-1
-0.5
0
0.5
-1 -0.5 0 0.5
-1
-0.5
0
0.5
Correlation: 0.90
18/52
The constrained covariance
The constrained covariance is
COCO@PXY A a sup
kf kp 1
kgkq 1
cov
2
4
0
@
IX
j =1
fj 'j @xA
1
A
0
@
IX
j =1
gj j @yA
1
A
3
5
18/52
The constrained covariance
The constrained covariance is
COCO@PXY A a sup
kf kp 1
kgkq 1
Exy
2
4
0
@
IX
j =1
fj 'j @xA
1
A
0
@
IX
j =1
gj j @yA
1
A
3
5
Fine print: feature mappings '(x) and (y) assumed to have zero mean.
18/52
The constrained covariance
The constrained covariance is
COCO@PXY A a sup
kf kp 1
kgkq 1
Exy
2
4
0
@
IX
j =1
fj 'j @xA
1
A
0
@
IX
j =1
gj j @yA
1
A
3
5
Fine print: feature mappings '(x) and (y) assumed to have zero mean.
Rewriting:
Exy ‘f @xAg@yA“
a
2
6
6
4
f1
f2
...
3
7
7
5
b
Exy
0
B
B
@
2
6
6
4
'1@xA
'2@xA
...
3
7
7
5
h
1@yA 2@yA :::
i
1
C
C
A
| {z }
C'(x)(y)
2
6
6
4
g1
g2
...
3
7
7
5
18/52
The constrained covariance
The constrained covariance is
COCO@PXY A a sup
kf kp 1
kgkq 1
Exy
2
4
0
@
IX
j =1
fj 'j @xA
1
A
0
@
IX
j =1
gj j @yA
1
A
3
5
Fine print: feature mappings '(x) and (y) assumed to have zero mean.
Rewriting:
Exy ‘f @xAg@yA“
a
2
6
6
4
f1
f2
...
3
7
7
5
b
Exy
0
B
B
@
2
6
6
4
'1@xA
'2@xA
...
3
7
7
5
h
1@yA 2@yA :::
i
1
C
C
A
| {z }
C'(x)(y)
2
6
6
4
g1
g2
...
3
7
7
5
COCO: max singular value of feature covariance C'(x)(y)
18/52
Computing COCO in practice
Given sample f@xi ;yi Agn
i=1
i:i:d:
$ PXY , what is empirical COCO ?
19/52
Computing COCO in practice
Given sample f@xi ;yi Agn
i=1
i:i:d:
$ PXY , what is empirical COCO ?
COCO is largest eigenvalue 
max of

0 1
n KL
1
n LK 0
#
#
a 

K 0
0 L
#
#
:
Kij a k@xi ;xj A and Lij a l@yi ;yj A.
Fine print: kernels are computed with empirically centered features '(x)   1
n
Pn
i=1
'(xi )
and (y)   1
n
Pn
i=1
(yi ).
AG., A. Smola., O. Bousquet, R. Herbrich, A. Belitski, M. Augath, Y. Murayama, J. Pauls, B.
Schoelkopf, and N. Logothetis, AISTATS’05
19/52
Computing COCO in practice
Given sample f@xi ;yi Agn
i=1
i:i:d:
$ PXY , what is empirical COCO ?
COCO is largest eigenvalue 
max of

0 1
n KL
1
n LK 0
#
#
a 

K 0
0 L
#
#
:
Kij a k@xi ;xj A and Lij a l@yi ;yj A.
Witness functions (singular vectors):
f @xA G
mX
i=1
i k@xi ;xA g@yA G
nX
i=1
i l@yi ;yA
Fine print: kernels are computed with empirically centered features '(x)   1
n
Pn
i=1
'(xi )
and (y)   1
n
Pn
i=1
(yi ).
AG., A. Smola., O. Bousquet, R. Herbrich, A. Belitski, M. Augath, Y. Murayama, J. Pauls, B.
Schoelkopf, and N. Logothetis, AISTATS’05
19/52
What is a large dependence with COCO?
−2 0 2
−3
−2
−1
0
1
2
3
X
Y
Smooth density
−4 −2 0 2 4
−4
−2
0
2
4
X
Y
500 Samples, smooth density
−2 0 2
−3
−2
−1
0
1
2
3
X
Y
Rough density
−4 −2 0 2 4
−4
−2
0
2
4
X
Y
500 samples, rough density
Density takes the form:
PXY G 1Csin@!xAsin@!yA
Which of these is the more “dependent”?
20/52
Finding covariance with smooth transformations
Case of ! a 1:
-4 -2 0 2 4
-4
-2
0
2
4
Correlation: 0.31
-2 0 2
-1
-0.5
0
0.5
1
-2 0 2
-1
-0.5
0
0.5
1
-0.5 0 0.5
-0.5
0
0.5
Correlation: 0.50 COCO: 0.09
21/52
Finding covariance with smooth transformations
Case of ! a 2:
-4 -2 0 2 4
-4
-2
0
2
4
Correlation: 0.02
-2 0 2
-1
-0.5
0
0.5
1
-2 0 2
-1
-0.5
0
0.5
1
-0.5 0 0.5
-0.5
0
0.5
Correlation: 0.54 COCO: 0.07
22/52
Finding covariance with smooth transformations
Case of ! a 3:
-4 -2 0 2 4
-4
-2
0
2
4
Correlation: 0.03
-2 0 2
-1
-0.5
0
0.5
1
-2 0 2
-1
-0.5
0
0.5
1
-0.5 0 0.5
-0.5
0
0.5
Correlation: 0.44 COCO: 0.04
23/52
Finding covariance with smooth transformations
Case of ! a 4:
-4 -2 0 2 4
-4
-2
0
2
4
Correlation: 0.05
-2 0 2
-1
-0.5
0
0.5
1
-2 0 2
-1
-0.5
0
0.5
1
-0.5 0 0.5
-0.5
0
0.5
Correlation: 0.25 COCO: 0.02
24/52
Finding covariance with smooth transformations
Case of ! acc:
-4 -2 0 2 4
-4
-2
0
2
4
Correlation: 0.01
-2 0 2
-1
-0.5
0
0.5
1
-2 0 2
-1
-0.5
0
0.5
1
-0.5 0 0.5
-0.5
0
0.5
Correlation: 0.14 COCO: 0.02
25/52
Finding covariance with smooth transformations
Case of ! a 0: uniform noise! (shows bias)
-4 -2 0 2 4
-4
-2
0
2
4
Correlation: 0.01
-2 0 2
-1
-0.5
0
0.5
1
-2 0 2
-1
-0.5
0
0.5
1
-0.5 0 0.5
-0.5
0
0.5
Correlation: 0.14 COCO: 0.02
26/52
Dependence largest when at “low” frequencies
As dependence is encoded at higher frequencies, the smooth
mappings f ;g achieve lower linear dependence.
Even for independent variables, COCO will not be zero at finite
sample sizes, since some mild linear dependence will be found by f,g
(bias)
This bias will decrease with increasing sample size.
27/52
Can we do better than COCO?
A second example with zero correlation.
First singular value of feature covariance C'(x)(y):
-1 -0.5 0 0.5 1
-1
-0.5
0
0.5
1
Correlation: 0.00
-2 0 2
-1
-0.5
0
0.5
-2 0 2
-1
-0.5
0
0.5
-1 -0.5 0 0.5
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
Correlation: 0.80 COCO1
: 0.11
28/52
Can we do better than COCO?
A second example with zero correlation.
Second singular value of feature covariance C'(x)(y):
-1 -0.5 0 0.5 1
-1
-0.5
0
0.5
1
Correlation: 0.00
-2 0 2
-1
-0.5
0
0.5
1
-2 0 2
-1
-0.5
0
0.5
1
28/52
Can we do better than COCO?
A second example with zero correlation.
Second singular value of feature covariance C'(x)(y):
-1 -0.5 0 0.5 1
-1
-0.5
0
0.5
1
Correlation: 0.00
-2 0 2
-1
-0.5
0
0.5
1
-2 0 2
-1
-0.5
0
0.5
1
-1 -0.5 0 0.5 1
-0.5
0
0.5
Correlation: 0.37 COCO2
: 0.06
28/52
The Hilbert-Schmidt Independence Criterion
Writing the ith singular value of the feature covariance C'(x)(y) as

i Xa COCOi @PXY Yp;qA;
define Hilbert-Schmidt Independence Criterion (HSIC)
HSIC2
@PXY Yp;qA a
IX
i=1

2
i :
AG, O. Bousquet , A. Smola., and B. Schoelkopf, ALT2005; AG,.,K. Fukumizu„C.H. Teo., L. Song., B.
Schoelkopf., and A. Smola, NIPS 2007,.
29/52
The Hilbert-Schmidt Independence Criterion
Writing the ith singular value of the feature covariance C'(x)(y) as

i Xa COCOi @PXY Yp;qA;
define Hilbert-Schmidt Independence Criterion (HSIC)
HSIC2
@PXY Yp;qA a
IX
i=1

2
i :
AG, O. Bousquet , A. Smola., and B. Schoelkopf, ALT2005; AG,.,K. Fukumizu„C.H. Teo., L. Song., B.
Schoelkopf., and A. Smola, NIPS 2007,.
HSIC is MMD with product kernel!
HSIC2
@PXY Yp;qA a MMD2
@PXY ;PX PY YrA
where @@x;yA;@xH;yHAA a k@x;xHAl@y;yHA.
29/52
Asymptotics of HSIC under independence
Given sample f@xi ;yi gn
i=1
i:i:d:
$ PXY , what is empirical HSIC?
Empirical HSIC (biased)
HSIC a 1
n2
trace@KLA
Kij a k@xi ;xj A and Lij a l@yi yj A (K and L computed with
empirically centered features)
Statistical testing: given PXY a PX PY , what is the threshold c
such that P@HSIC  cA   for small ?
Asymptotics of HSIC when PXY a PX PY :
n HSIC
D
3
IX
l=1
l z2
l ; zl $ x@0;1Ai:i:d:
where l l (zj ) =
R
hijqr l (zi )dFi;q;r ; hijqr = 1
4!
P(i;j ;q;r)
(t;u;v;w)
ktu ltu + ktu lvw  2ktu ltv
30/52
Asymptotics of HSIC under independence
Given sample f@xi ;yi gn
i=1
i:i:d:
$ PXY , what is empirical HSIC?
Empirical HSIC (biased)
HSIC a 1
n2
trace@KLA
Kij a k@xi ;xj A and Lij a l@yi yj A (K and L computed with
empirically centered features)
Statistical testing: given PXY a PX PY , what is the threshold c
such that P@HSIC  cA   for small ?
Asymptotics of HSIC when PXY a PX PY :
n HSIC
D
3
IX
l=1
l z2
l ; zl $ x@0;1Ai:i:d:
where l l (zj ) =
R
hijqr l (zi )dFi;q;r ; hijqr = 1
4!
P(i;j ;q;r)
(t;u;v;w)
ktu ltu + ktu lvw  2ktu ltv
30/52
Asymptotics of HSIC under independence
Given sample f@xi ;yi gn
i=1
i:i:d:
$ PXY , what is empirical HSIC?
Empirical HSIC (biased)
HSIC a 1
n2
trace@KLA
Kij a k@xi ;xj A and Lij a l@yi yj A (K and L computed with
empirically centered features)
Statistical testing: given PXY a PX PY , what is the threshold c
such that P@HSIC  cA   for small ?
Asymptotics of HSIC when PXY a PX PY :
n HSIC
D
3
IX
l=1
l z2
l ; zl $ x@0;1Ai:i:d:
where l l (zj ) =
R
hijqr l (zi )dFi;q;r ; hijqr = 1
4!
P(i;j ;q;r)
(t;u;v;w)
ktu ltu + ktu lvw  2ktu ltv
30/52
Asymptotics of HSIC under independence
Given sample f@xi ;yi gn
i=1
i:i:d:
$ PXY , what is empirical HSIC?
Empirical HSIC (biased)
HSIC a 1
n2
trace@KLA
Kij a k@xi ;xj A and Lij a l@yi yj A (K and L computed with
empirically centered features)
Statistical testing: given PXY a PX PY , what is the threshold c
such that P@HSIC  cA   for small ?
Asymptotics of HSIC when PXY a PX PY :
n HSIC
D
3
IX
l=1
l z2
l ; zl $ x@0;1Ai:i:d:
where l l (zj ) =
R
hijqr l (zi )dFi;q;r ; hijqr = 1
4!
P(i;j ;q;r)
(t;u;v;w)
ktu ltu + ktu lvw  2ktu ltv
30/52
A statistical test
Given PXY a PX PY , what is the threshold c such that
P@HSIC  cA   for small  (prob. of false positive)?
Original time series:
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
Y1 Y2 Y3 Y4 Y5 Y6 Y7 Y8 Y9 Y10
Permutation:
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
Y7 Y3 Y9 Y2 Y4 Y8 Y5 Y1 Y6 Y10
Null distribution via permutation
 Compute HSIC for fxi ;y(i)gn
i=1 for random permutation  of indices
f1;:::;ng. This gives HSIC for independent variables.
 Repeat for many different permutations, get empirical CDF
 Threshold c is 1   quantile of empirical CDF 31/52
A statistical test
Given PXY a PX PY , what is the threshold c such that
P@HSIC  cA   for small  (prob. of false positive)?
Original time series:
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
Y1 Y2 Y3 Y4 Y5 Y6 Y7 Y8 Y9 Y10
Permutation:
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
Y7 Y3 Y9 Y2 Y4 Y8 Y5 Y1 Y6 Y10
Null distribution via permutation
 Compute HSIC for fxi ;y(i)gn
i=1 for random permutation  of indices
f1;:::;ng. This gives HSIC for independent variables.
 Repeat for many different permutations, get empirical CDF
 Threshold c is 1   quantile of empirical CDF 31/52
A statistical test
Given PXY a PX PY , what is the threshold c such that
P@HSIC  cA   for small  (prob. of false positive)?
Original time series:
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
Y1 Y2 Y3 Y4 Y5 Y6 Y7 Y8 Y9 Y10
Permutation:
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
Y7 Y3 Y9 Y2 Y4 Y8 Y5 Y1 Y6 Y10
Null distribution via permutation
 Compute HSIC for fxi ;y(i)gn
i=1 for random permutation  of indices
f1;:::;ng. This gives HSIC for independent variables.
 Repeat for many different permutations, get empirical CDF
 Threshold c is 1   quantile of empirical CDF 31/52
Application: dependence detection across languages
Testing task: detect dependence between English and French text
Les	ordres	de	gouvernements	
provinciaux	et	municipaux	
subissent	de	fortes	pressions
Honourable	senators,	I	have	a	
question	for	the	Leader	of	the	
Government	in	the	Senate
Text	from	the	aligned	hansards of	the	36th parliament	of	canada,
https://www.isi.edu/natural-language/download/hansard/
YX
Honorables	sénateurs,	ma	question	
s’adresse	au	leader	du	
gouvernement	au	Sénat
Au	contraire,	nous	avons	augmenté	
le	financement	fédéral	pour	le	
développement	des	jeunes	
No	doubt	there	is	great	pressure	
on	provincial	and	municipal	
governments	
In	fact,	we	have	increased	
federal	investments	for	early	
childhood	development.	
...
...
32/52
Application: dependence detection across languages
Testing task: detect dependence between English and French text
k-spectrum kernel, k a 10, sample size n a 10
HSIC a 1
n2
trace@KLA
(K and L column centered) 33/52
Application:Dependence detection across languages
Results (for  a 0:05)
k-spectrum kernel: average Type II error 0
Bag of words kernel: average Type II error 0.18
Settings: Five line extracts, averaged over 300 repetitions, for
“Agriculture” transcripts. Similar results for Fisheries and
Immigration transcripts.
34/52
Testing higher order interactions
35/52
Detecting higher order interaction
How to detect V-structures with pairwise weak individual
dependence?
X Y
Z
36/52
Detecting higher order interaction
How to detect V-structures with pairwise weak individual
dependence?
36/52
Detecting higher order interaction
How to detect V-structures with pairwise weak individual
dependence?
X cc Y ;Y cc Z;X cc Z
X1 vs Y1 Y1 vs Z1
X1 vs Z1 X1*Y1 vs Z1
X Y
Z
X ;Y
i:i:d:
$ x@0;1A
Zj X ;Y $ sign@XY AExp@ 1p
2
A
Fine print: Faithfulness violated here!
36/52
V-structure discovery
X Y
Z
Assume X cc Y has been established.
V-structure can then be detected by:
Consistent CI test: H0 X X cc Y jZ [Fukumizu et al. 2008, Zhang et al. 2011]
Factorisation test: H0 X @X ;Y A cc Z • @X ;ZA cc Y • @Y ;ZA cc X
(multiple standard two-variable tests)
How well do these work?
37/52
Detecting higher order interaction
Generalise earlier example to p dimensions
X cc Y ;Y cc Z;X cc Z
X1 vs Y1 Y1 vs Z1
X1 vs Z1 X1*Y1 vs Z1
X Y
Z
X ;Y
i:i:d:
$ x@0;1A
Zj X ;Y $ sign@XY AExp@ 1p
2
A
X2:p;Y2:p;Z2:p
i:i:d:
$ x@0;Ip 1A
Fine print: Faithfulness violated here!
38/52
V-structure discovery
CI test for X cc Y jZ from Zhang et al. (2011), and a factorisation
test, n a 500
39/52
Lancaster interaction measure
Lancaster interaction measure of @X1;:::;XD A $ P is a signed
measure ¡P that vanishes whenever P can be factorised non-trivially.
D a 2 X ¡LP a PXY  PX PY
40/52
Lancaster interaction measure
Lancaster interaction measure of @X1;:::;XD A $ P is a signed
measure ¡P that vanishes whenever P can be factorised non-trivially.
D a 2 X ¡LP a PXY  PX PY
D a 3 X ¡LP a PXYZ  PX PYZ  PY PXZ  PZ PXY C2PX PY PZ
40/52
Lancaster interaction measure
Lancaster interaction measure of @X1;:::;XD A $ P is a signed
measure ¡P that vanishes whenever P can be factorised non-trivially.
D a 2 X ¡LP a PXY  PX PY
D a 3 X ¡LP a PXYZ  PX PYZ  PY PXZ  PZ PXY C2PX PY PZ
X Y
Z
X Y
Z
X Y
Z
X Y
Z
PXY Z −PXPY Z −PY PXZ −PZPXY +2PXPY PZ
∆LP =
40/52
Lancaster interaction measure
Lancaster interaction measure of @X1;:::;XD A $ P is a signed
measure ¡P that vanishes whenever P can be factorised non-trivially.
D a 2 X ¡LP a PXY  PX PY
D a 3 X ¡LP a PXYZ  PX PYZ  PY PXZ  PZ PXY C2PX PY PZ
X Y
Z
X Y
Z
X Y
Z
X Y
Z
PXY Z −PXPY Z −PXZPY −PXY PZ +2PXPY PZ
∆LP = 0
Case of PX cc PYZ
40/52
Lancaster interaction measure
Lancaster interaction measure of @X1;:::;XD A $ P is a signed
measure ¡P that vanishes whenever P can be factorised non-trivially.
D a 2 X ¡LP a PXY  PX PY
D a 3 X ¡LP a PXYZ  PX PYZ  PY PXZ  PZ PXY C2PX PY PZ
@X ;Y A cc Z • @X ;ZA cc Y • @Y ;ZA cc X A ¡LP a 0:
...so what might be missed?
40/52
Lancaster interaction measure
Lancaster interaction measure of @X1;:::;XD A $ P is a signed
measure ¡P that vanishes whenever P can be factorised non-trivially.
D a 2 X ¡LP a PXY  PX PY
D a 3 X ¡LP a PXYZ  PX PYZ  PY PXZ  PZ PXY C2PX PY PZ
¡LP a 0 ;@X ;Y A cc Z • @X ;ZA cc Y • @Y ;ZA cc X
Example:
P(0;0;0) = 0:2 P(0;0;1) = 0:1 P(1;0;0) = 0:1 P(1;0;1) = 0:1
P(0;1;0) = 0:1 P(0;1;1) = 0:1 P(1;1;0) = 0:1 P(1;1;1) = 0:2
40/52
A kernel test statistic using Lancaster Measure
Construct a test by estimating k @¡LPAk2
r ; where  a k l m:
k@PXYZ  PXY PZ  ¡¡¡Ak2
r a
hPXYZ ;PXYZ ir  2 hPXYZ ;PXY PZ ir ¡¡¡
41/52
A kernel test statistic using Lancaster Measure
Table: V -statistic estimators of h;Hir
(without terms PX PY PZ ). H
is centering matrix I  n 1
Lancaster interaction statistic: D. Sejdinovic, AG, W. Bergsma, NIPS13
k @¡LPAk2
r a 1
n2
@HKH HLH HMHA++ :
Empirical joint central moment in the feature space
42/52
A kernel test statistic using Lancaster Measure
Table: V -statistic estimators of h;Hir
(without terms PX PY PZ ). H
is centering matrix I  n 1
Lancaster interaction statistic: D. Sejdinovic, AG, W. Bergsma, NIPS13
k @¡LPAk2
r a 1
n2
@HKH HLH HMHA++ :
Empirical joint central moment in the feature space
42/52
V-structure discovery
Lancaster test, CI test for X cc Y jZ from Zhang et al. (2011), and a
factorisation test, n a 500 43/52
Interaction for D  4
Interaction measure valid for all D:
(Streitberg, 1990)
¡S P a
X

@ 1Ajj 1
@jj 1A3JP
 For a partition , J associates to the
joint the corresponding factorisation,
e.g., J13j2j4P = PX1X3
PX2
PX4
:
44/52
Interaction for D  4
Interaction measure valid for all D:
(Streitberg, 1990)
¡S P a
X

@ 1Ajj 1
@jj 1A3JP
 For a partition , J associates to the
joint the corresponding factorisation,
e.g., J13j2j4P = PX1X3
PX2
PX4
:
44/52
Interaction for D  4
Interaction measure valid for all D:
(Streitberg, 1990)
¡S P a
X

@ 1Ajj 1
@jj 1A3JP
 For a partition , J associates to the
joint the corresponding factorisation,
e.g., J13j2j4P = PX1X3
PX2
PX4
:
1e+04
1e+09
1e+14
1e+19
1 3 5 7 9 11 13 15 17 19 21 23 25
D
Numberofpartitionsof{1,...,D}
Bell numbers growth
44/52
Part 4: Advanced topics
45/52

More Related Content

What's hot

Pushdown automata
Pushdown automataPushdown automata
Pushdown automata
parmeet834
 
On Resolution Proofs for Combinational Equivalence
On Resolution Proofs for Combinational EquivalenceOn Resolution Proofs for Combinational Equivalence
On Resolution Proofs for Combinational Equivalencesatrajit
 
A New Enhanced Method of Non Parametric power spectrum Estimation.
A New Enhanced Method of Non Parametric power spectrum Estimation.A New Enhanced Method of Non Parametric power spectrum Estimation.
A New Enhanced Method of Non Parametric power spectrum Estimation.
CSCJournals
 
Discrete Logarithm Problem over Prime Fields, Non-canonical Lifts and Logarit...
Discrete Logarithm Problem over Prime Fields, Non-canonical Lifts and Logarit...Discrete Logarithm Problem over Prime Fields, Non-canonical Lifts and Logarit...
Discrete Logarithm Problem over Prime Fields, Non-canonical Lifts and Logarit...
PadmaGadiyar
 
Reducing Structural Bias in Technology Mapping
Reducing Structural Bias in Technology MappingReducing Structural Bias in Technology Mapping
Reducing Structural Bias in Technology Mappingsatrajit
 
TMPA-2017: Generating Cost Aware Covering Arrays For Free
TMPA-2017: Generating Cost Aware Covering Arrays For Free TMPA-2017: Generating Cost Aware Covering Arrays For Free
TMPA-2017: Generating Cost Aware Covering Arrays For Free
Iosif Itkin
 
Bayesian Inference and Uncertainty Quantification for Inverse Problems
Bayesian Inference and Uncertainty Quantification for Inverse ProblemsBayesian Inference and Uncertainty Quantification for Inverse Problems
Bayesian Inference and Uncertainty Quantification for Inverse Problems
Matt Moores
 
bayesImageS: Bayesian computation for medical Image Segmentation using a hidd...
bayesImageS: Bayesian computation for medical Image Segmentation using a hidd...bayesImageS: Bayesian computation for medical Image Segmentation using a hidd...
bayesImageS: Bayesian computation for medical Image Segmentation using a hidd...
Matt Moores
 
20130928 automated theorem_proving_harrison
20130928 automated theorem_proving_harrison20130928 automated theorem_proving_harrison
20130928 automated theorem_proving_harrisonComputer Science Club
 
R package 'bayesImageS': a case study in Bayesian computation using Rcpp and ...
R package 'bayesImageS': a case study in Bayesian computation using Rcpp and ...R package 'bayesImageS': a case study in Bayesian computation using Rcpp and ...
R package 'bayesImageS': a case study in Bayesian computation using Rcpp and ...
Matt Moores
 
Proof of Kraft-McMillan theorem
Proof of Kraft-McMillan theoremProof of Kraft-McMillan theorem
Proof of Kraft-McMillan theorem
Vu Hung Nguyen
 
Path Contraction Faster than 2^n
Path Contraction Faster than 2^nPath Contraction Faster than 2^n
Path Contraction Faster than 2^n
AkankshaAgrawal55
 
Fine Grained Complexity
Fine Grained ComplexityFine Grained Complexity
Fine Grained Complexity
AkankshaAgrawal55
 
A new transformation into State Transition Algorithm for finding the global m...
A new transformation into State Transition Algorithm for finding the global m...A new transformation into State Transition Algorithm for finding the global m...
A new transformation into State Transition Algorithm for finding the global m...
Michael_Chou
 
RDFS with Attribute Equations via SPARQL Rewriting
RDFS with Attribute Equations via SPARQL RewritingRDFS with Attribute Equations via SPARQL Rewriting
RDFS with Attribute Equations via SPARQL Rewriting
Stefan Bischof
 
new optimization algorithm for topology optimization
new optimization algorithm for topology optimizationnew optimization algorithm for topology optimization
new optimization algorithm for topology optimization
Seonho Park
 
Efficient Edge-Skeleton Computation for Polytopes Defined by Oracles
Efficient Edge-Skeleton Computation for Polytopes Defined by OraclesEfficient Edge-Skeleton Computation for Polytopes Defined by Oracles
Efficient Edge-Skeleton Computation for Polytopes Defined by Oracles
Vissarion Fisikopoulos
 

What's hot (20)

Pushdown automata
Pushdown automataPushdown automata
Pushdown automata
 
On Resolution Proofs for Combinational Equivalence
On Resolution Proofs for Combinational EquivalenceOn Resolution Proofs for Combinational Equivalence
On Resolution Proofs for Combinational Equivalence
 
A New Enhanced Method of Non Parametric power spectrum Estimation.
A New Enhanced Method of Non Parametric power spectrum Estimation.A New Enhanced Method of Non Parametric power spectrum Estimation.
A New Enhanced Method of Non Parametric power spectrum Estimation.
 
Discrete Logarithm Problem over Prime Fields, Non-canonical Lifts and Logarit...
Discrete Logarithm Problem over Prime Fields, Non-canonical Lifts and Logarit...Discrete Logarithm Problem over Prime Fields, Non-canonical Lifts and Logarit...
Discrete Logarithm Problem over Prime Fields, Non-canonical Lifts and Logarit...
 
Reducing Structural Bias in Technology Mapping
Reducing Structural Bias in Technology MappingReducing Structural Bias in Technology Mapping
Reducing Structural Bias in Technology Mapping
 
AI Lesson 05
AI Lesson 05AI Lesson 05
AI Lesson 05
 
TMPA-2017: Generating Cost Aware Covering Arrays For Free
TMPA-2017: Generating Cost Aware Covering Arrays For Free TMPA-2017: Generating Cost Aware Covering Arrays For Free
TMPA-2017: Generating Cost Aware Covering Arrays For Free
 
Bayesian Inference and Uncertainty Quantification for Inverse Problems
Bayesian Inference and Uncertainty Quantification for Inverse ProblemsBayesian Inference and Uncertainty Quantification for Inverse Problems
Bayesian Inference and Uncertainty Quantification for Inverse Problems
 
bayesImageS: Bayesian computation for medical Image Segmentation using a hidd...
bayesImageS: Bayesian computation for medical Image Segmentation using a hidd...bayesImageS: Bayesian computation for medical Image Segmentation using a hidd...
bayesImageS: Bayesian computation for medical Image Segmentation using a hidd...
 
20130928 automated theorem_proving_harrison
20130928 automated theorem_proving_harrison20130928 automated theorem_proving_harrison
20130928 automated theorem_proving_harrison
 
Lecture26
Lecture26Lecture26
Lecture26
 
R package 'bayesImageS': a case study in Bayesian computation using Rcpp and ...
R package 'bayesImageS': a case study in Bayesian computation using Rcpp and ...R package 'bayesImageS': a case study in Bayesian computation using Rcpp and ...
R package 'bayesImageS': a case study in Bayesian computation using Rcpp and ...
 
Proof of Kraft-McMillan theorem
Proof of Kraft-McMillan theoremProof of Kraft-McMillan theorem
Proof of Kraft-McMillan theorem
 
Path Contraction Faster than 2^n
Path Contraction Faster than 2^nPath Contraction Faster than 2^n
Path Contraction Faster than 2^n
 
Fine Grained Complexity
Fine Grained ComplexityFine Grained Complexity
Fine Grained Complexity
 
A new transformation into State Transition Algorithm for finding the global m...
A new transformation into State Transition Algorithm for finding the global m...A new transformation into State Transition Algorithm for finding the global m...
A new transformation into State Transition Algorithm for finding the global m...
 
RDFS with Attribute Equations via SPARQL Rewriting
RDFS with Attribute Equations via SPARQL RewritingRDFS with Attribute Equations via SPARQL Rewriting
RDFS with Attribute Equations via SPARQL Rewriting
 
new optimization algorithm for topology optimization
new optimization algorithm for topology optimizationnew optimization algorithm for topology optimization
new optimization algorithm for topology optimization
 
Randomization
RandomizationRandomization
Randomization
 
Efficient Edge-Skeleton Computation for Polytopes Defined by Oracles
Efficient Edge-Skeleton Computation for Polytopes Defined by OraclesEfficient Edge-Skeleton Computation for Polytopes Defined by Oracles
Efficient Edge-Skeleton Computation for Polytopes Defined by Oracles
 

Similar to Representing and comparing probabilities: Part 2

Convolutional networks and graph networks through kernels
Convolutional networks and graph networks through kernelsConvolutional networks and graph networks through kernels
Convolutional networks and graph networks through kernels
tuxette
 
MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...
The Statistical and Applied Mathematical Sciences Institute
 
Nec 602 unit ii Random Variables and Random process
Nec 602 unit ii Random Variables and Random processNec 602 unit ii Random Variables and Random process
Nec 602 unit ii Random Variables and Random process
Dr Naim R Kidwai
 
Nonparametric testing for exogeneity with discrete regressors and instruments
Nonparametric testing for exogeneity with discrete regressors and instrumentsNonparametric testing for exogeneity with discrete regressors and instruments
Nonparametric testing for exogeneity with discrete regressors and instruments
GRAPE
 
SYMMETRIC BILINEAR CRYPTOGRAPHY ON ELLIPTIC CURVE AND LIE ALGEBRA
SYMMETRIC BILINEAR CRYPTOGRAPHY ON ELLIPTIC CURVE  AND LIE ALGEBRASYMMETRIC BILINEAR CRYPTOGRAPHY ON ELLIPTIC CURVE  AND LIE ALGEBRA
SYMMETRIC BILINEAR CRYPTOGRAPHY ON ELLIPTIC CURVE AND LIE ALGEBRA
BRNSS Publication Hub
 
[DL輪読会]Conditional Neural Processes
[DL輪読会]Conditional Neural Processes[DL輪読会]Conditional Neural Processes
[DL輪読会]Conditional Neural Processes
Deep Learning JP
 
Conditional neural processes
Conditional neural processesConditional neural processes
Conditional neural processes
Kazuki Fujikawa
 
Iwsm2014 an analogy-based approach to estimation of software development ef...
Iwsm2014   an analogy-based approach to estimation of software development ef...Iwsm2014   an analogy-based approach to estimation of software development ef...
Iwsm2014 an analogy-based approach to estimation of software development ef...
Nesma
 
A Distributed Tableau Algorithm for Package-based Description Logics
A Distributed Tableau Algorithm for Package-based Description LogicsA Distributed Tableau Algorithm for Package-based Description Logics
A Distributed Tableau Algorithm for Package-based Description LogicsJie Bao
 
Accelerating Metropolis Hastings with Lightweight Inference Compilation
Accelerating Metropolis Hastings with Lightweight Inference CompilationAccelerating Metropolis Hastings with Lightweight Inference Compilation
Accelerating Metropolis Hastings with Lightweight Inference Compilation
Feynman Liang
 
An optimal and progressive algorithm for skyline queries slide
An optimal and progressive algorithm for skyline queries slideAn optimal and progressive algorithm for skyline queries slide
An optimal and progressive algorithm for skyline queries slide
WooSung Choi
 
A new integer programming model for hp problem
A new integer programming model for hp problemA new integer programming model for hp problem
A new integer programming model for hp problemwonther
 
MarcoCeze_defense
MarcoCeze_defenseMarcoCeze_defense
MarcoCeze_defenseMarco Ceze
 
Optimal L-shaped matrix reordering, aka graph's core-periphery
Optimal L-shaped matrix reordering, aka graph's core-peripheryOptimal L-shaped matrix reordering, aka graph's core-periphery
Optimal L-shaped matrix reordering, aka graph's core-periphery
Francesco Tudisco
 
PAGOdA poster
PAGOdA posterPAGOdA poster
PAGOdA poster
DBOnto
 
Robust Image Denoising in RKHS via Orthogonal Matching Pursuit
Robust Image Denoising in RKHS via Orthogonal Matching PursuitRobust Image Denoising in RKHS via Orthogonal Matching Pursuit
Robust Image Denoising in RKHS via Orthogonal Matching Pursuit
Pantelis Bouboulis
 
Sep logic slide
Sep logic slideSep logic slide
Sep logic slide
rainoftime
 
Dcs unit 2
Dcs unit 2Dcs unit 2
Dcs unit 2
Anil Nigam
 

Similar to Representing and comparing probabilities: Part 2 (20)

Convolutional networks and graph networks through kernels
Convolutional networks and graph networks through kernelsConvolutional networks and graph networks through kernels
Convolutional networks and graph networks through kernels
 
MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...
 
Nec 602 unit ii Random Variables and Random process
Nec 602 unit ii Random Variables and Random processNec 602 unit ii Random Variables and Random process
Nec 602 unit ii Random Variables and Random process
 
Nonparametric testing for exogeneity with discrete regressors and instruments
Nonparametric testing for exogeneity with discrete regressors and instrumentsNonparametric testing for exogeneity with discrete regressors and instruments
Nonparametric testing for exogeneity with discrete regressors and instruments
 
SYMMETRIC BILINEAR CRYPTOGRAPHY ON ELLIPTIC CURVE AND LIE ALGEBRA
SYMMETRIC BILINEAR CRYPTOGRAPHY ON ELLIPTIC CURVE  AND LIE ALGEBRASYMMETRIC BILINEAR CRYPTOGRAPHY ON ELLIPTIC CURVE  AND LIE ALGEBRA
SYMMETRIC BILINEAR CRYPTOGRAPHY ON ELLIPTIC CURVE AND LIE ALGEBRA
 
[DL輪読会]Conditional Neural Processes
[DL輪読会]Conditional Neural Processes[DL輪読会]Conditional Neural Processes
[DL輪読会]Conditional Neural Processes
 
Conditional neural processes
Conditional neural processesConditional neural processes
Conditional neural processes
 
Iwsm2014 an analogy-based approach to estimation of software development ef...
Iwsm2014   an analogy-based approach to estimation of software development ef...Iwsm2014   an analogy-based approach to estimation of software development ef...
Iwsm2014 an analogy-based approach to estimation of software development ef...
 
A Distributed Tableau Algorithm for Package-based Description Logics
A Distributed Tableau Algorithm for Package-based Description LogicsA Distributed Tableau Algorithm for Package-based Description Logics
A Distributed Tableau Algorithm for Package-based Description Logics
 
Accelerating Metropolis Hastings with Lightweight Inference Compilation
Accelerating Metropolis Hastings with Lightweight Inference CompilationAccelerating Metropolis Hastings with Lightweight Inference Compilation
Accelerating Metropolis Hastings with Lightweight Inference Compilation
 
An optimal and progressive algorithm for skyline queries slide
An optimal and progressive algorithm for skyline queries slideAn optimal and progressive algorithm for skyline queries slide
An optimal and progressive algorithm for skyline queries slide
 
Lesson 26
Lesson 26Lesson 26
Lesson 26
 
AI Lesson 26
AI Lesson 26AI Lesson 26
AI Lesson 26
 
A new integer programming model for hp problem
A new integer programming model for hp problemA new integer programming model for hp problem
A new integer programming model for hp problem
 
MarcoCeze_defense
MarcoCeze_defenseMarcoCeze_defense
MarcoCeze_defense
 
Optimal L-shaped matrix reordering, aka graph's core-periphery
Optimal L-shaped matrix reordering, aka graph's core-peripheryOptimal L-shaped matrix reordering, aka graph's core-periphery
Optimal L-shaped matrix reordering, aka graph's core-periphery
 
PAGOdA poster
PAGOdA posterPAGOdA poster
PAGOdA poster
 
Robust Image Denoising in RKHS via Orthogonal Matching Pursuit
Robust Image Denoising in RKHS via Orthogonal Matching PursuitRobust Image Denoising in RKHS via Orthogonal Matching Pursuit
Robust Image Denoising in RKHS via Orthogonal Matching Pursuit
 
Sep logic slide
Sep logic slideSep logic slide
Sep logic slide
 
Dcs unit 2
Dcs unit 2Dcs unit 2
Dcs unit 2
 

More from MLReview

Bayesian Non-parametric Models for Data Science using PyMC
 Bayesian Non-parametric Models for Data Science using PyMC Bayesian Non-parametric Models for Data Science using PyMC
Bayesian Non-parametric Models for Data Science using PyMC
MLReview
 
Machine Learning and Counterfactual Reasoning for "Personalized" Decision- ...
  Machine Learning and Counterfactual Reasoning for "Personalized" Decision- ...  Machine Learning and Counterfactual Reasoning for "Personalized" Decision- ...
Machine Learning and Counterfactual Reasoning for "Personalized" Decision- ...
MLReview
 
Tutorial on Deep Generative Models
 Tutorial on Deep Generative Models Tutorial on Deep Generative Models
Tutorial on Deep Generative Models
MLReview
 
PixelGAN Autoencoders
  PixelGAN Autoencoders  PixelGAN Autoencoders
PixelGAN Autoencoders
MLReview
 
OPTIMIZATION AS A MODEL FOR FEW-SHOT LEARNING
 OPTIMIZATION AS A MODEL FOR FEW-SHOT LEARNING OPTIMIZATION AS A MODEL FOR FEW-SHOT LEARNING
OPTIMIZATION AS A MODEL FOR FEW-SHOT LEARNING
MLReview
 
Theoretical Neuroscience and Deep Learning Theory
Theoretical Neuroscience and Deep Learning TheoryTheoretical Neuroscience and Deep Learning Theory
Theoretical Neuroscience and Deep Learning Theory
MLReview
 
2017 Tutorial - Deep Learning for Dialogue Systems
2017 Tutorial - Deep Learning for Dialogue Systems2017 Tutorial - Deep Learning for Dialogue Systems
2017 Tutorial - Deep Learning for Dialogue Systems
MLReview
 
Deep Learning for Semantic Composition
Deep Learning for Semantic CompositionDeep Learning for Semantic Composition
Deep Learning for Semantic Composition
MLReview
 
Near human performance in question answering?
Near human performance in question answering?Near human performance in question answering?
Near human performance in question answering?
MLReview
 
Tutorial on Theory and Application of Generative Adversarial Networks
Tutorial on Theory and Application of Generative Adversarial NetworksTutorial on Theory and Application of Generative Adversarial Networks
Tutorial on Theory and Application of Generative Adversarial Networks
MLReview
 
Real-time Edge-aware Image Processing with the Bilateral Grid
Real-time Edge-aware Image Processing with the Bilateral GridReal-time Edge-aware Image Processing with the Bilateral Grid
Real-time Edge-aware Image Processing with the Bilateral Grid
MLReview
 
Yoav Goldberg: Word Embeddings What, How and Whither
Yoav Goldberg: Word Embeddings What, How and WhitherYoav Goldberg: Word Embeddings What, How and Whither
Yoav Goldberg: Word Embeddings What, How and Whither
MLReview
 

More from MLReview (12)

Bayesian Non-parametric Models for Data Science using PyMC
 Bayesian Non-parametric Models for Data Science using PyMC Bayesian Non-parametric Models for Data Science using PyMC
Bayesian Non-parametric Models for Data Science using PyMC
 
Machine Learning and Counterfactual Reasoning for "Personalized" Decision- ...
  Machine Learning and Counterfactual Reasoning for "Personalized" Decision- ...  Machine Learning and Counterfactual Reasoning for "Personalized" Decision- ...
Machine Learning and Counterfactual Reasoning for "Personalized" Decision- ...
 
Tutorial on Deep Generative Models
 Tutorial on Deep Generative Models Tutorial on Deep Generative Models
Tutorial on Deep Generative Models
 
PixelGAN Autoencoders
  PixelGAN Autoencoders  PixelGAN Autoencoders
PixelGAN Autoencoders
 
OPTIMIZATION AS A MODEL FOR FEW-SHOT LEARNING
 OPTIMIZATION AS A MODEL FOR FEW-SHOT LEARNING OPTIMIZATION AS A MODEL FOR FEW-SHOT LEARNING
OPTIMIZATION AS A MODEL FOR FEW-SHOT LEARNING
 
Theoretical Neuroscience and Deep Learning Theory
Theoretical Neuroscience and Deep Learning TheoryTheoretical Neuroscience and Deep Learning Theory
Theoretical Neuroscience and Deep Learning Theory
 
2017 Tutorial - Deep Learning for Dialogue Systems
2017 Tutorial - Deep Learning for Dialogue Systems2017 Tutorial - Deep Learning for Dialogue Systems
2017 Tutorial - Deep Learning for Dialogue Systems
 
Deep Learning for Semantic Composition
Deep Learning for Semantic CompositionDeep Learning for Semantic Composition
Deep Learning for Semantic Composition
 
Near human performance in question answering?
Near human performance in question answering?Near human performance in question answering?
Near human performance in question answering?
 
Tutorial on Theory and Application of Generative Adversarial Networks
Tutorial on Theory and Application of Generative Adversarial NetworksTutorial on Theory and Application of Generative Adversarial Networks
Tutorial on Theory and Application of Generative Adversarial Networks
 
Real-time Edge-aware Image Processing with the Bilateral Grid
Real-time Edge-aware Image Processing with the Bilateral GridReal-time Edge-aware Image Processing with the Bilateral Grid
Real-time Edge-aware Image Processing with the Bilateral Grid
 
Yoav Goldberg: Word Embeddings What, How and Whither
Yoav Goldberg: Word Embeddings What, How and WhitherYoav Goldberg: Word Embeddings What, How and Whither
Yoav Goldberg: Word Embeddings What, How and Whither
 

Recently uploaded

如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
yqqaatn0
 
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptxThe use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
MAGOTI ERNEST
 
Orion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWSOrion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWS
Columbia Weather Systems
 
Seminar of U.V. Spectroscopy by SAMIR PANDA
 Seminar of U.V. Spectroscopy by SAMIR PANDA Seminar of U.V. Spectroscopy by SAMIR PANDA
Seminar of U.V. Spectroscopy by SAMIR PANDA
SAMIR PANDA
 
ESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptxESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptx
PRIYANKA PATEL
 
Toxic effects of heavy metals : Lead and Arsenic
Toxic effects of heavy metals : Lead and ArsenicToxic effects of heavy metals : Lead and Arsenic
Toxic effects of heavy metals : Lead and Arsenic
sanjana502982
 
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
Studia Poinsotiana
 
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Sérgio Sacani
 
Nutraceutical market, scope and growth: Herbal drug technology
Nutraceutical market, scope and growth: Herbal drug technologyNutraceutical market, scope and growth: Herbal drug technology
Nutraceutical market, scope and growth: Herbal drug technology
Lokesh Patil
 
Shallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptxShallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptx
Gokturk Mehmet Dilci
 
Mudde & Rovira Kaltwasser. - Populism - a very short introduction [2017].pdf
Mudde & Rovira Kaltwasser. - Populism - a very short introduction [2017].pdfMudde & Rovira Kaltwasser. - Populism - a very short introduction [2017].pdf
Mudde & Rovira Kaltwasser. - Populism - a very short introduction [2017].pdf
frank0071
 
SAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdfSAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdf
KrushnaDarade1
 
Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.
Nistarini College, Purulia (W.B) India
 
Chapter 12 - climate change and the energy crisis
Chapter 12 - climate change and the energy crisisChapter 12 - climate change and the energy crisis
Chapter 12 - climate change and the energy crisis
tonzsalvador2222
 
Leaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdfLeaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdf
RenuJangid3
 
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Ana Luísa Pinho
 
What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.
moosaasad1975
 
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
Abdul Wali Khan University Mardan,kP,Pakistan
 
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
David Osipyan
 
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
Wasswaderrick3
 

Recently uploaded (20)

如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
 
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptxThe use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
 
Orion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWSOrion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWS
 
Seminar of U.V. Spectroscopy by SAMIR PANDA
 Seminar of U.V. Spectroscopy by SAMIR PANDA Seminar of U.V. Spectroscopy by SAMIR PANDA
Seminar of U.V. Spectroscopy by SAMIR PANDA
 
ESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptxESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptx
 
Toxic effects of heavy metals : Lead and Arsenic
Toxic effects of heavy metals : Lead and ArsenicToxic effects of heavy metals : Lead and Arsenic
Toxic effects of heavy metals : Lead and Arsenic
 
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
 
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
 
Nutraceutical market, scope and growth: Herbal drug technology
Nutraceutical market, scope and growth: Herbal drug technologyNutraceutical market, scope and growth: Herbal drug technology
Nutraceutical market, scope and growth: Herbal drug technology
 
Shallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptxShallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptx
 
Mudde & Rovira Kaltwasser. - Populism - a very short introduction [2017].pdf
Mudde & Rovira Kaltwasser. - Populism - a very short introduction [2017].pdfMudde & Rovira Kaltwasser. - Populism - a very short introduction [2017].pdf
Mudde & Rovira Kaltwasser. - Populism - a very short introduction [2017].pdf
 
SAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdfSAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdf
 
Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.
 
Chapter 12 - climate change and the energy crisis
Chapter 12 - climate change and the energy crisisChapter 12 - climate change and the energy crisis
Chapter 12 - climate change and the energy crisis
 
Leaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdfLeaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdf
 
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
 
What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.
 
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
 
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
 
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
 

Representing and comparing probabilities: Part 2

  • 1. Representing and comparing probabilities: Part 2 Arthur Gretton Gatsby Computational Neuroscience Unit, University College London UAI, 2017 1/52
  • 2. Testing against a probabilistic model 2/52
  • 3. Statistical model criticism MMD@P;QA a kf £k2 a supkf kF 1‘EQ f  Epf “ -4 -2 2 4 -0.3 -0.2 -0.1 0.1 0.2 0.3 0.4 p(x) q(x) f * (x) f £@xA is the witness function Can we compute MMD with samples from Q and a model P? Problem: usualy can’t compute Epf in closed form. 3/52
  • 4. Stein idea To get rid of Epf in sup kf kF 1 ‘Eq f  Epf “ we define the Stein operator ‘Tpf “@xA a 1 p@xA d dx @f @xAp@xAA Then EP TP f a 0 subject to appropriate boundary conditions. (Oates, Girolami, Chopin, 2016) 4/52
  • 5. Stein idea: proof Ep ‘Tpf “ a Z 1 p@xA d dx @f @xAp@xAA p@xAdx Z d dx @f @xAp@xAA dx a ‘f @xAp@xA“I  I a 0 5/52
  • 6. Stein idea: proof Ep ‘Tpf “ a Z 1 ¨¨¨p@xA d dx @f @xAp@xAA ¨ ¨¨p@xAdx Z d dx @f @xAp@xAA dx a ‘f @xAp@xA“I  I a 0 5/52
  • 7. Stein idea: proof Ep ‘Tpf “ a Z 1 ¨¨¨p@xA d dx @f @xAp@xAA ¨ ¨¨p@xAdx Z d dx @f @xAp@xAA dx a ‘f @xAp@xA“I  I a 0 5/52
  • 8. Stein idea: proof Ep ‘Tpf “ a Z 1 ¨¨¨p@xA d dx @f @xAp@xAA ¨ ¨¨p@xAdx Z d dx @f @xAp@xAA dx a ‘f @xAp@xA“I  I a 0 5/52
  • 9. Stein idea: proof Ep ‘Tpf “ a Z 1 ¨¨¨p@xA d dx @f @xAp@xAA ¨ ¨¨p@xAdx Z d dx @f @xAp@xAA dx a ‘f @xAp@xA“I  I a 0 5/52
  • 10. Kernel Stein Discrepancy Stein operator Tpf a @x f Cf @x @log pA Kernel Stein Discrepancy (KSD) KSD@p;q;pA a sup kgkF 1 Eq Tpg  EpTpg 6/52
  • 11. Kernel Stein Discrepancy Stein operator Tpf a @x f Cf @x @log pA Kernel Stein Discrepancy (KSD) KSD@p;q;pA a sup kgkF 1 Eq Tpg  $$$$EpTpg a sup kgkF 1 Eq Tpg 6/52
  • 12. Kernel Stein Discrepancy Stein operator Tpf a @x f Cf @x @log pA Kernel Stein Discrepancy (KSD) KSD@p;q;pA a sup kgkF 1 Eq Tpg  $$$$EpTpg a sup kgkF 1 Eq Tpg -4 -2 2 4 -0.6 -0.4 -0.2 0.2 0.4 p(x) q(x) g* (x) 6/52
  • 13. Kernel Stein Discrepancy Stein operator Tpf a @x f Cf @x @log pA Kernel Stein Discrepancy (KSD) KSD@p;q;pA a sup kgkF 1 Eq Tpg  $$$$EpTpg a sup kgkF 1 Eq Tpg -4 -2 2 4 0.1 0.2 0.3 0.4 p(x) q(x) g* (x) 6/52
  • 14. Kernel stein discrepancy Closed-form expression for KSD: given Z;ZH $ q, then (Chwialkowski, Strathmann, G., ICML 2016) (Liu, Lee, Jordan ICML 2016) KSD@p;q;pA a Eq hp@Z;ZHA where hp@x;yA Xa @x log p@xA@x log p@yAk@x;yA C@y log p@yA@x k@x;yA C@x log p@xA@y k@x;yA C@x @y k@x;yA and k is RKHS kernel for p Only depends on kernel and @x log p(x). Do not need to normalize p, or sample from it. 7/52
  • 16. Statistical model criticism Chicago crime data Model is Gaussian mixture with two components. 8/52
  • 17. Statistical model criticism Chicago crime data Model is Gaussian mixture with two components Stein witness function 8/52
  • 18. Statistical model criticism Chicago crime data Model is Gaussian mixture with ten components. 8/52
  • 19. Statistical model criticism Chicago crime data Model is Gaussian mixture with ten components Stein witness function Code: https://github.com/karlnapf/kernel_goodness_of_fit 8/52
  • 20. Kernel stein discrepancy Further applications: Evaluation of approximate MCMC methods. (Chwialkowski, Strathmann, G., ICML 2016; Gorham, Mackey, ICML 2017) What kernel to use? The inverse multiquadric kernel, k@x;yA a c Ckx  yk2 2
  • 21. for
  • 24. Dependence testing Given: Samples from a distribution PX Y Goal: Are X and Y independent? Their noses guide them through life, and they're never happier than when following an interesting scent. A large animal who slings slobber, exudes a distinctive houndy odor, and wants nothing more than to follow his nose. Text from dogtime.com and petfinder.com A responsive, interactive pet, one that will blow in your ear and follow you everywhere. YX 11/52
  • 25. MMD as a dependence measure? Could we use MMD? MMD@PXY | {z } P ;PX PY | {z } Q ;rA We don’t have samples from Q Xa PX PY , only pairs f@xi ;yi gn i=1 i:i:d: $ PXY Solution: simulate Q with pairs (xi ;yj ) for j T= i. What kernel to use for the RKHS r? 12/52
  • 26. MMD as a dependence measure? Could we use MMD? MMD@PXY | {z } P ;PX PY | {z } Q ;rA We don’t have samples from Q Xa PX PY , only pairs f@xi ;yi gn i=1 i:i:d: $ PXY Solution: simulate Q with pairs (xi ;yj ) for j T= i. What kernel to use for the RKHS r? 12/52
  • 27. MMD as a dependence measure? Could we use MMD? MMD@PXY | {z } P ;PX PY | {z } Q ;rA We don’t have samples from Q Xa PX PY , only pairs f@xi ;yi gn i=1 i:i:d: $ PXY Solution: simulate Q with pairs (xi ;yj ) for j T= i. What kernel to use for the RKHS r? 12/52
  • 28. MMD as a dependence measure Kernel k on images with feature space p, Kernel l on captions with feature space q, 13/52
  • 29. MMD as a dependence measure Kernel k on images with feature space p, Kernel l on captions with feature space q, Kernel on image-text pairs: are images and captions similar? 13/52
  • 30. MMD as a dependence measure Given: Samples from a distribution PX Y Goal: Are X and Y independent? MMD2 @bPXY ; bPX bPY ;rA Xa 1 n2 trace@KLA ( K, L column centered) 14/52
  • 31. MMD as a dependence measure Given: Samples from a distribution PX Y Goal: Are X and Y independent? MMD2 @bPXY ; bPX bPY ;rA Xa 1 n2 trace@KLA 14/52
  • 32. MMD as a dependence measure Two questions: Why the product kernel? Many ways to combine kernels - why not eg a sum? Is there a more interpretable way of defining this dependence measure? 15/52
  • 33. Finding covariance with smooth transformations Illustration: two variables with no correlation but strong dependence. -2 -1 0 1 2 -1.5 -1 -0.5 0 0.5 1 1.5 Correlation: 0.00 16/52
  • 34. Finding covariance with smooth transformations Illustration: two variables with no correlation but strong dependence. -2 -1 0 1 2 -1.5 -1 -0.5 0 0.5 1 1.5 Correlation: 0.00 -2 0 2 -1 -0.5 0 0.5 -2 0 2 -1 -0.5 0 0.5 16/52
  • 35. Finding covariance with smooth transformations Illustration: two variables with no correlation but strong dependence. -2 -1 0 1 2 -1.5 -1 -0.5 0 0.5 1 1.5 Correlation: 0.00 -2 0 2 -1 -0.5 0 0.5 -2 0 2 -1 -0.5 0 0.5 -1 -0.5 0 0.5 -1 -0.5 0 0.5 Correlation: 0.90 16/52
  • 36. Define two spaces, one for each witness Function in p f @xA a IX j =1 fj 'j @xA Feature map Kernel for RKHS p on ˆ: k@x;xHA a h'@xA;'@xHAip Function in q g@yA a IX j =1 gj j @yA Feature map Kernel for RKHS q on ‰: l@x;xHA a h@yA;@yHAiq 17/52
  • 37. The constrained covariance The constrained covariance is COCO@PXY A a sup kf kp 1 kgkq 1 cov‘f @xAg@yA“ -2 -1 0 1 2 -1.5 -1 -0.5 0 0.5 1 1.5 Correlation: 0.00 -2 0 2 -1 -0.5 0 0.5 -2 0 2 -1 -0.5 0 0.5 -1 -0.5 0 0.5 -1 -0.5 0 0.5 Correlation: 0.90 18/52
  • 38. The constrained covariance The constrained covariance is COCO@PXY A a sup kf kp 1 kgkq 1 cov 2 4 0 @ IX j =1 fj 'j @xA 1 A 0 @ IX j =1 gj j @yA 1 A 3 5 18/52
  • 39. The constrained covariance The constrained covariance is COCO@PXY A a sup kf kp 1 kgkq 1 Exy 2 4 0 @ IX j =1 fj 'j @xA 1 A 0 @ IX j =1 gj j @yA 1 A 3 5 Fine print: feature mappings '(x) and (y) assumed to have zero mean. 18/52
  • 40. The constrained covariance The constrained covariance is COCO@PXY A a sup kf kp 1 kgkq 1 Exy 2 4 0 @ IX j =1 fj 'j @xA 1 A 0 @ IX j =1 gj j @yA 1 A 3 5 Fine print: feature mappings '(x) and (y) assumed to have zero mean. Rewriting: Exy ‘f @xAg@yA“ a 2 6 6 4 f1 f2 ... 3 7 7 5 b Exy 0 B B @ 2 6 6 4 '1@xA '2@xA ... 3 7 7 5 h 1@yA 2@yA ::: i 1 C C A | {z } C'(x)(y) 2 6 6 4 g1 g2 ... 3 7 7 5 18/52
  • 41. The constrained covariance The constrained covariance is COCO@PXY A a sup kf kp 1 kgkq 1 Exy 2 4 0 @ IX j =1 fj 'j @xA 1 A 0 @ IX j =1 gj j @yA 1 A 3 5 Fine print: feature mappings '(x) and (y) assumed to have zero mean. Rewriting: Exy ‘f @xAg@yA“ a 2 6 6 4 f1 f2 ... 3 7 7 5 b Exy 0 B B @ 2 6 6 4 '1@xA '2@xA ... 3 7 7 5 h 1@yA 2@yA ::: i 1 C C A | {z } C'(x)(y) 2 6 6 4 g1 g2 ... 3 7 7 5 COCO: max singular value of feature covariance C'(x)(y) 18/52
  • 42. Computing COCO in practice Given sample f@xi ;yi Agn i=1 i:i:d: $ PXY , what is empirical COCO ? 19/52
  • 43. Computing COCO in practice Given sample f@xi ;yi Agn i=1 i:i:d: $ PXY , what is empirical COCO ? COCO is largest eigenvalue max of 0 1 n KL 1 n LK 0 #
  • 44. # a K 0 0 L #
  • 45. # : Kij a k@xi ;xj A and Lij a l@yi ;yj A. Fine print: kernels are computed with empirically centered features '(x)   1 n Pn i=1 '(xi ) and (y)   1 n Pn i=1 (yi ). AG., A. Smola., O. Bousquet, R. Herbrich, A. Belitski, M. Augath, Y. Murayama, J. Pauls, B. Schoelkopf, and N. Logothetis, AISTATS’05 19/52
  • 46. Computing COCO in practice Given sample f@xi ;yi Agn i=1 i:i:d: $ PXY , what is empirical COCO ? COCO is largest eigenvalue max of 0 1 n KL 1 n LK 0 #
  • 47. # a K 0 0 L #
  • 48. # : Kij a k@xi ;xj A and Lij a l@yi ;yj A. Witness functions (singular vectors): f @xA G mX i=1 i k@xi ;xA g@yA G nX i=1
  • 49. i l@yi ;yA Fine print: kernels are computed with empirically centered features '(x)   1 n Pn i=1 '(xi ) and (y)   1 n Pn i=1 (yi ). AG., A. Smola., O. Bousquet, R. Herbrich, A. Belitski, M. Augath, Y. Murayama, J. Pauls, B. Schoelkopf, and N. Logothetis, AISTATS’05 19/52
  • 50. What is a large dependence with COCO? −2 0 2 −3 −2 −1 0 1 2 3 X Y Smooth density −4 −2 0 2 4 −4 −2 0 2 4 X Y 500 Samples, smooth density −2 0 2 −3 −2 −1 0 1 2 3 X Y Rough density −4 −2 0 2 4 −4 −2 0 2 4 X Y 500 samples, rough density Density takes the form: PXY G 1Csin@!xAsin@!yA Which of these is the more “dependent”? 20/52
  • 51. Finding covariance with smooth transformations Case of ! a 1: -4 -2 0 2 4 -4 -2 0 2 4 Correlation: 0.31 -2 0 2 -1 -0.5 0 0.5 1 -2 0 2 -1 -0.5 0 0.5 1 -0.5 0 0.5 -0.5 0 0.5 Correlation: 0.50 COCO: 0.09 21/52
  • 52. Finding covariance with smooth transformations Case of ! a 2: -4 -2 0 2 4 -4 -2 0 2 4 Correlation: 0.02 -2 0 2 -1 -0.5 0 0.5 1 -2 0 2 -1 -0.5 0 0.5 1 -0.5 0 0.5 -0.5 0 0.5 Correlation: 0.54 COCO: 0.07 22/52
  • 53. Finding covariance with smooth transformations Case of ! a 3: -4 -2 0 2 4 -4 -2 0 2 4 Correlation: 0.03 -2 0 2 -1 -0.5 0 0.5 1 -2 0 2 -1 -0.5 0 0.5 1 -0.5 0 0.5 -0.5 0 0.5 Correlation: 0.44 COCO: 0.04 23/52
  • 54. Finding covariance with smooth transformations Case of ! a 4: -4 -2 0 2 4 -4 -2 0 2 4 Correlation: 0.05 -2 0 2 -1 -0.5 0 0.5 1 -2 0 2 -1 -0.5 0 0.5 1 -0.5 0 0.5 -0.5 0 0.5 Correlation: 0.25 COCO: 0.02 24/52
  • 55. Finding covariance with smooth transformations Case of ! acc: -4 -2 0 2 4 -4 -2 0 2 4 Correlation: 0.01 -2 0 2 -1 -0.5 0 0.5 1 -2 0 2 -1 -0.5 0 0.5 1 -0.5 0 0.5 -0.5 0 0.5 Correlation: 0.14 COCO: 0.02 25/52
  • 56. Finding covariance with smooth transformations Case of ! a 0: uniform noise! (shows bias) -4 -2 0 2 4 -4 -2 0 2 4 Correlation: 0.01 -2 0 2 -1 -0.5 0 0.5 1 -2 0 2 -1 -0.5 0 0.5 1 -0.5 0 0.5 -0.5 0 0.5 Correlation: 0.14 COCO: 0.02 26/52
  • 57. Dependence largest when at “low” frequencies As dependence is encoded at higher frequencies, the smooth mappings f ;g achieve lower linear dependence. Even for independent variables, COCO will not be zero at finite sample sizes, since some mild linear dependence will be found by f,g (bias) This bias will decrease with increasing sample size. 27/52
  • 58. Can we do better than COCO? A second example with zero correlation. First singular value of feature covariance C'(x)(y): -1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 Correlation: 0.00 -2 0 2 -1 -0.5 0 0.5 -2 0 2 -1 -0.5 0 0.5 -1 -0.5 0 0.5 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 Correlation: 0.80 COCO1 : 0.11 28/52
  • 59. Can we do better than COCO? A second example with zero correlation. Second singular value of feature covariance C'(x)(y): -1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 Correlation: 0.00 -2 0 2 -1 -0.5 0 0.5 1 -2 0 2 -1 -0.5 0 0.5 1 28/52
  • 60. Can we do better than COCO? A second example with zero correlation. Second singular value of feature covariance C'(x)(y): -1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 Correlation: 0.00 -2 0 2 -1 -0.5 0 0.5 1 -2 0 2 -1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 -0.5 0 0.5 Correlation: 0.37 COCO2 : 0.06 28/52
  • 61. The Hilbert-Schmidt Independence Criterion Writing the ith singular value of the feature covariance C'(x)(y) as i Xa COCOi @PXY Yp;qA; define Hilbert-Schmidt Independence Criterion (HSIC) HSIC2 @PXY Yp;qA a IX i=1 2 i : AG, O. Bousquet , A. Smola., and B. Schoelkopf, ALT2005; AG,.,K. Fukumizu„C.H. Teo., L. Song., B. Schoelkopf., and A. Smola, NIPS 2007,. 29/52
  • 62. The Hilbert-Schmidt Independence Criterion Writing the ith singular value of the feature covariance C'(x)(y) as i Xa COCOi @PXY Yp;qA; define Hilbert-Schmidt Independence Criterion (HSIC) HSIC2 @PXY Yp;qA a IX i=1 2 i : AG, O. Bousquet , A. Smola., and B. Schoelkopf, ALT2005; AG,.,K. Fukumizu„C.H. Teo., L. Song., B. Schoelkopf., and A. Smola, NIPS 2007,. HSIC is MMD with product kernel! HSIC2 @PXY Yp;qA a MMD2 @PXY ;PX PY YrA where @@x;yA;@xH;yHAA a k@x;xHAl@y;yHA. 29/52
  • 63. Asymptotics of HSIC under independence Given sample f@xi ;yi gn i=1 i:i:d: $ PXY , what is empirical HSIC? Empirical HSIC (biased) HSIC a 1 n2 trace@KLA Kij a k@xi ;xj A and Lij a l@yi yj A (K and L computed with empirically centered features) Statistical testing: given PXY a PX PY , what is the threshold c such that P@HSIC cA for small ? Asymptotics of HSIC when PXY a PX PY : n HSIC D 3 IX l=1 l z2 l ; zl $ x@0;1Ai:i:d: where l l (zj ) = R hijqr l (zi )dFi;q;r ; hijqr = 1 4! P(i;j ;q;r) (t;u;v;w) ktu ltu + ktu lvw  2ktu ltv 30/52
  • 64. Asymptotics of HSIC under independence Given sample f@xi ;yi gn i=1 i:i:d: $ PXY , what is empirical HSIC? Empirical HSIC (biased) HSIC a 1 n2 trace@KLA Kij a k@xi ;xj A and Lij a l@yi yj A (K and L computed with empirically centered features) Statistical testing: given PXY a PX PY , what is the threshold c such that P@HSIC cA for small ? Asymptotics of HSIC when PXY a PX PY : n HSIC D 3 IX l=1 l z2 l ; zl $ x@0;1Ai:i:d: where l l (zj ) = R hijqr l (zi )dFi;q;r ; hijqr = 1 4! P(i;j ;q;r) (t;u;v;w) ktu ltu + ktu lvw  2ktu ltv 30/52
  • 65. Asymptotics of HSIC under independence Given sample f@xi ;yi gn i=1 i:i:d: $ PXY , what is empirical HSIC? Empirical HSIC (biased) HSIC a 1 n2 trace@KLA Kij a k@xi ;xj A and Lij a l@yi yj A (K and L computed with empirically centered features) Statistical testing: given PXY a PX PY , what is the threshold c such that P@HSIC cA for small ? Asymptotics of HSIC when PXY a PX PY : n HSIC D 3 IX l=1 l z2 l ; zl $ x@0;1Ai:i:d: where l l (zj ) = R hijqr l (zi )dFi;q;r ; hijqr = 1 4! P(i;j ;q;r) (t;u;v;w) ktu ltu + ktu lvw  2ktu ltv 30/52
  • 66. Asymptotics of HSIC under independence Given sample f@xi ;yi gn i=1 i:i:d: $ PXY , what is empirical HSIC? Empirical HSIC (biased) HSIC a 1 n2 trace@KLA Kij a k@xi ;xj A and Lij a l@yi yj A (K and L computed with empirically centered features) Statistical testing: given PXY a PX PY , what is the threshold c such that P@HSIC cA for small ? Asymptotics of HSIC when PXY a PX PY : n HSIC D 3 IX l=1 l z2 l ; zl $ x@0;1Ai:i:d: where l l (zj ) = R hijqr l (zi )dFi;q;r ; hijqr = 1 4! P(i;j ;q;r) (t;u;v;w) ktu ltu + ktu lvw  2ktu ltv 30/52
  • 67. A statistical test Given PXY a PX PY , what is the threshold c such that P@HSIC cA for small (prob. of false positive)? Original time series: X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 Y1 Y2 Y3 Y4 Y5 Y6 Y7 Y8 Y9 Y10 Permutation: X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 Y7 Y3 Y9 Y2 Y4 Y8 Y5 Y1 Y6 Y10 Null distribution via permutation Compute HSIC for fxi ;y(i)gn i=1 for random permutation of indices f1;:::;ng. This gives HSIC for independent variables. Repeat for many different permutations, get empirical CDF Threshold c is 1   quantile of empirical CDF 31/52
  • 68. A statistical test Given PXY a PX PY , what is the threshold c such that P@HSIC cA for small (prob. of false positive)? Original time series: X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 Y1 Y2 Y3 Y4 Y5 Y6 Y7 Y8 Y9 Y10 Permutation: X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 Y7 Y3 Y9 Y2 Y4 Y8 Y5 Y1 Y6 Y10 Null distribution via permutation Compute HSIC for fxi ;y(i)gn i=1 for random permutation of indices f1;:::;ng. This gives HSIC for independent variables. Repeat for many different permutations, get empirical CDF Threshold c is 1   quantile of empirical CDF 31/52
  • 69. A statistical test Given PXY a PX PY , what is the threshold c such that P@HSIC cA for small (prob. of false positive)? Original time series: X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 Y1 Y2 Y3 Y4 Y5 Y6 Y7 Y8 Y9 Y10 Permutation: X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 Y7 Y3 Y9 Y2 Y4 Y8 Y5 Y1 Y6 Y10 Null distribution via permutation Compute HSIC for fxi ;y(i)gn i=1 for random permutation of indices f1;:::;ng. This gives HSIC for independent variables. Repeat for many different permutations, get empirical CDF Threshold c is 1   quantile of empirical CDF 31/52
  • 70. Application: dependence detection across languages Testing task: detect dependence between English and French text Les ordres de gouvernements provinciaux et municipaux subissent de fortes pressions Honourable senators, I have a question for the Leader of the Government in the Senate Text from the aligned hansards of the 36th parliament of canada, https://www.isi.edu/natural-language/download/hansard/ YX Honorables sénateurs, ma question s’adresse au leader du gouvernement au Sénat Au contraire, nous avons augmenté le financement fédéral pour le développement des jeunes No doubt there is great pressure on provincial and municipal governments In fact, we have increased federal investments for early childhood development. ... ... 32/52
  • 71. Application: dependence detection across languages Testing task: detect dependence between English and French text k-spectrum kernel, k a 10, sample size n a 10 HSIC a 1 n2 trace@KLA (K and L column centered) 33/52
  • 72. Application:Dependence detection across languages Results (for a 0:05) k-spectrum kernel: average Type II error 0 Bag of words kernel: average Type II error 0.18 Settings: Five line extracts, averaged over 300 repetitions, for “Agriculture” transcripts. Similar results for Fisheries and Immigration transcripts. 34/52
  • 73. Testing higher order interactions 35/52
  • 74. Detecting higher order interaction How to detect V-structures with pairwise weak individual dependence? X Y Z 36/52
  • 75. Detecting higher order interaction How to detect V-structures with pairwise weak individual dependence? 36/52
  • 76. Detecting higher order interaction How to detect V-structures with pairwise weak individual dependence? X cc Y ;Y cc Z;X cc Z X1 vs Y1 Y1 vs Z1 X1 vs Z1 X1*Y1 vs Z1 X Y Z X ;Y i:i:d: $ x@0;1A Zj X ;Y $ sign@XY AExp@ 1p 2 A Fine print: Faithfulness violated here! 36/52
  • 77. V-structure discovery X Y Z Assume X cc Y has been established. V-structure can then be detected by: Consistent CI test: H0 X X cc Y jZ [Fukumizu et al. 2008, Zhang et al. 2011] Factorisation test: H0 X @X ;Y A cc Z • @X ;ZA cc Y • @Y ;ZA cc X (multiple standard two-variable tests) How well do these work? 37/52
  • 78. Detecting higher order interaction Generalise earlier example to p dimensions X cc Y ;Y cc Z;X cc Z X1 vs Y1 Y1 vs Z1 X1 vs Z1 X1*Y1 vs Z1 X Y Z X ;Y i:i:d: $ x@0;1A Zj X ;Y $ sign@XY AExp@ 1p 2 A X2:p;Y2:p;Z2:p i:i:d: $ x@0;Ip 1A Fine print: Faithfulness violated here! 38/52
  • 79. V-structure discovery CI test for X cc Y jZ from Zhang et al. (2011), and a factorisation test, n a 500 39/52
  • 80. Lancaster interaction measure Lancaster interaction measure of @X1;:::;XD A $ P is a signed measure ¡P that vanishes whenever P can be factorised non-trivially. D a 2 X ¡LP a PXY  PX PY 40/52
  • 81. Lancaster interaction measure Lancaster interaction measure of @X1;:::;XD A $ P is a signed measure ¡P that vanishes whenever P can be factorised non-trivially. D a 2 X ¡LP a PXY  PX PY D a 3 X ¡LP a PXYZ  PX PYZ  PY PXZ  PZ PXY C2PX PY PZ 40/52
  • 82. Lancaster interaction measure Lancaster interaction measure of @X1;:::;XD A $ P is a signed measure ¡P that vanishes whenever P can be factorised non-trivially. D a 2 X ¡LP a PXY  PX PY D a 3 X ¡LP a PXYZ  PX PYZ  PY PXZ  PZ PXY C2PX PY PZ X Y Z X Y Z X Y Z X Y Z PXY Z −PXPY Z −PY PXZ −PZPXY +2PXPY PZ ∆LP = 40/52
  • 83. Lancaster interaction measure Lancaster interaction measure of @X1;:::;XD A $ P is a signed measure ¡P that vanishes whenever P can be factorised non-trivially. D a 2 X ¡LP a PXY  PX PY D a 3 X ¡LP a PXYZ  PX PYZ  PY PXZ  PZ PXY C2PX PY PZ X Y Z X Y Z X Y Z X Y Z PXY Z −PXPY Z −PXZPY −PXY PZ +2PXPY PZ ∆LP = 0 Case of PX cc PYZ 40/52
  • 84. Lancaster interaction measure Lancaster interaction measure of @X1;:::;XD A $ P is a signed measure ¡P that vanishes whenever P can be factorised non-trivially. D a 2 X ¡LP a PXY  PX PY D a 3 X ¡LP a PXYZ  PX PYZ  PY PXZ  PZ PXY C2PX PY PZ @X ;Y A cc Z • @X ;ZA cc Y • @Y ;ZA cc X A ¡LP a 0: ...so what might be missed? 40/52
  • 85. Lancaster interaction measure Lancaster interaction measure of @X1;:::;XD A $ P is a signed measure ¡P that vanishes whenever P can be factorised non-trivially. D a 2 X ¡LP a PXY  PX PY D a 3 X ¡LP a PXYZ  PX PYZ  PY PXZ  PZ PXY C2PX PY PZ ¡LP a 0 ;@X ;Y A cc Z • @X ;ZA cc Y • @Y ;ZA cc X Example: P(0;0;0) = 0:2 P(0;0;1) = 0:1 P(1;0;0) = 0:1 P(1;0;1) = 0:1 P(0;1;0) = 0:1 P(0;1;1) = 0:1 P(1;1;0) = 0:1 P(1;1;1) = 0:2 40/52
  • 86. A kernel test statistic using Lancaster Measure Construct a test by estimating k @¡LPAk2 r ; where a k l m: k@PXYZ  PXY PZ  ¡¡¡Ak2 r a hPXYZ ;PXYZ ir  2 hPXYZ ;PXY PZ ir ¡¡¡ 41/52
  • 87. A kernel test statistic using Lancaster Measure Table: V -statistic estimators of h;Hir (without terms PX PY PZ ). H is centering matrix I  n 1 Lancaster interaction statistic: D. Sejdinovic, AG, W. Bergsma, NIPS13 k @¡LPAk2 r a 1 n2 @HKH HLH HMHA++ : Empirical joint central moment in the feature space 42/52
  • 88. A kernel test statistic using Lancaster Measure Table: V -statistic estimators of h;Hir (without terms PX PY PZ ). H is centering matrix I  n 1 Lancaster interaction statistic: D. Sejdinovic, AG, W. Bergsma, NIPS13 k @¡LPAk2 r a 1 n2 @HKH HLH HMHA++ : Empirical joint central moment in the feature space 42/52
  • 89. V-structure discovery Lancaster test, CI test for X cc Y jZ from Zhang et al. (2011), and a factorisation test, n a 500 43/52
  • 90. Interaction for D 4 Interaction measure valid for all D: (Streitberg, 1990) ¡S P a X @ 1Ajj 1 @jj 1A3JP For a partition , J associates to the joint the corresponding factorisation, e.g., J13j2j4P = PX1X3 PX2 PX4 : 44/52
  • 91. Interaction for D 4 Interaction measure valid for all D: (Streitberg, 1990) ¡S P a X @ 1Ajj 1 @jj 1A3JP For a partition , J associates to the joint the corresponding factorisation, e.g., J13j2j4P = PX1X3 PX2 PX4 : 44/52
  • 92. Interaction for D 4 Interaction measure valid for all D: (Streitberg, 1990) ¡S P a X @ 1Ajj 1 @jj 1A3JP For a partition , J associates to the joint the corresponding factorisation, e.g., J13j2j4P = PX1X3 PX2 PX4 : 1e+04 1e+09 1e+14 1e+19 1 3 5 7 9 11 13 15 17 19 21 23 25 D Numberofpartitionsof{1,...,D} Bell numbers growth 44/52
  • 93. Part 4: Advanced topics 45/52
  • 94. Advanced topics testing on time series testing for conditional dependence regression and conditional mean embedding 46/52
  • 99. Measures of divergence Sriperumbudur, Fukumizu, G, Schoelkopf, Lanckriet (2012) 51/52
  • 100. Co-authors From Gatsby: Kacper Chwialkowski Wittawat Jitkrittum Bharath Sriperumbudur Heiko Strathmann Dougal Sutherland Zoltan Szabo Wenkai Xu External collaborators: Kenji Fukumizu Bernhard Schoelkopf Alex Smola Questions? 52/52