MDL PRINCIPLE

.
.
The MDL principle for arbitrary data:
either discrete or continuous or none of them
Joe Suzuki
Osaka University
WITMSE 2013
Sanjo-Kaikan, University of Tokyo, Japan
August 26, 2013
Joe Suzuki (Osaka University) The MDL principle for arbitrary data: either discrete or continuous or none of them
WITMSE 2013Sanjo-Kaikan, University of To
/ 24

Road Map
Road Map
1 Problem
2 The Ryabko measure
3 The Radon-Nikodym theorem
4 Generalization
5 Universal Histogram Sequence
6 Conclusion
/ 24

Road Map
The slides of this talk can be seen via Internet
keywords: Joe Suzuki
slideshare
http://www.slideshare.net/prof-joe/
/ 24

Problem
Given {(xi , yi )}n
i=1, identify whether X ⊥⊥ Y or not
A, B: ﬁnite sets
xn = (x1, · · · , xn) ∈ An, yn = (y1, · · · , yn) ∈ Bn
Pn(xn|θ), Pn(yn|θ), Pn(xn, yn|θ): expressed by parameter θ
p: the prior probability of X ⊥⊥ Y
Bayesian solution
.
.
X ⊥⊥ Y ⇐⇒ pQn
(xn
)Qn
(yn
) ≥ (1 − p)Qn
(xn
, yn
)
Qn
(xn
) :=
∫
Pn
(xn
|θ)w(θ)dθ , Qn
(yn
) :=
∫
Pn
(yn
|θ)w(θ)dθ
Qn
(xn
, yn
) :=
∫
Pn
(xn
, yn
|θ)w(θ)dθ
using a weight w over θ.
/ 24

Problem
Q should be an alternative to P as n grows
A: the ﬁnite set in which X takes values.
Q is a Bayesian measure
.
.
Kraft’s inequality: ∑
xn∈An
Qn
(xn
) ≤ 1 (1)

For Example, Qn(xn) = |A|−n, xn ∈ An
satisﬁes (1); but
does not converges to Pn in any sense
/ 24

Problem
Universal Bayesian Measures
Qn
(xn
) :=
∫
Pn
(xn
|θ)w(θ)dθ
w(θ) ∝
∏
x∈A
θ−a[x]
with {a[x] = 1
2 }x∈A (Krichevsky-Troﬁmov)
−
1
n
log Qn
(xn
) → H(P)
for any Pn
(xn
|θ) =
∏
x∈A
θ−c[x]
with {c[x]}x∈A in xn ∈ An.
Shannon McMillian Breiman:
−
1
n
log Pn
(xn
|θ) −→ H(P)
for any stationary ergodic P, so that for Pn(xn) := Pn(xn|θ),
1
n
log
Pn(xn)
Qn(xn)
→ 0 . (2)
/ 24

Problem
When X has a density function f
There exists a g s.t. ∫
xn∈Rn
gn
(xn
) ≤ 1 (3)
1
n
log
f n(xn)
gn(xn)
→ 0 (4)
for any f satisfying a condition mentioned later (Ryabko 2009).
/ 24

Problem
The problem in this paper
Universal Bayesian measure in the general settings
.
.
What are (1)(2) and (3)(4) for general random variables ?
1 without assuming either discrete or continuous
2 removing the constraint Ryabko poses:
/ 24

The Ryabko measure
Ryabko measure: X has a density function f
A: the set in which X takes values.
{Aj }∞
j=0 :
{
A0 := {A}
Aj+1 is a reﬁnement of Aj
For example, for A = [0, 1), A0 = {[0, 1)}
A1 = {[0, 1/2), [1/2, 1)}
A2 = {[0, 1/4), [1/4, 1/2), [1/2, 3/4), [3/4, 1)}
. . .
Aj = {[0, 2−(j−1)), [2−(j−1), 2 · 2−(j−1)), · · · , [(2j−1 − 1)2−(j−1), 1)}
. . .
sj : A → Aj : x ∈ a ∈ Aj =⇒ sj (x) = a
λ: the Lebesgue measure
fj (x) :=
Pj (sj (x))
λ(sj (x))
for x ∈ A
/ 24

The Ryabko measure
Given xn = (x1, · · · , xn) ∈ An s.t. (sj (x1), · · · , sj (xn)) = (a1, · · · , an) ∈ An
j ,
f n
j (xn
) := fj (x1) · · · fj (xn) =
Pj (a1) · · · Pj (an)
λ(a1) . . . λ(an)
.
gn
j (xn
) :=
Qn
j (a1, · · · , an)
λ(a1) · · · λ(an)
Qj : a universal Bayesian measure w.r.t. ﬁnite set Aj .

f n(xn) := f (x1) · · · f (xn)
gn
(xn
) :=
∞∑
j=0
wj gn
j (xn
) for {ωj }∞
j=1 s.t.
∑
j
ωj = 1, ωj > 0
1
n
log
f n(xn)
gn(xn)
→ 0
for any f s.t. diﬀerential entropy h(fj ) → h(f ) as j → ∞ (Ryabko, 2009)
/ 24

The Radon-Nikodym theorem
In general, exactly when a density function exists ?
B: the entire Borel sets of R
µ(D) := P(X ∈ D): the probability of (X ∈ D) for D ∈ B
FX : the distribution function of X
µ is absolutely continuous w.r.t. λ (µ ≪ λ)
.
The following two are equivalent:
1 f : R → R exists s.t. P(X ≤ x) = FX (x) =
∫
t≤x
f (t)dt
2 for any D ∈ B, λ(D) :=
∫
D dx = 0 =⇒ µ(D) = 0.
f (x) =
dFX (x)
dx
/ 24

Even discrete variables have density functions!
B: a countable subset of R
µ(D) := P(X ∈ D): the probability of (X ∈ D) for D ⊆ B
r : B → R
µ is absolutely continuous w.r.t. η (µ ≪ η)
.
1 f : B → R exists s.t. P(X ∈ D) =
∑
x∈D
f (x)r(x), D ⊆ B
2 for any D ⊆ B, η(D) :=
∑
x∈D
r(x) = 0 =⇒ µ(D) = 0.
f (x) =
P(X = x)
r(x)
/ 24

Radon-Nikodym
µ, η: σ-ﬁnite measures over σ-ﬁeld F
µ is absolutely continuous w.r.t. η (µ ≪ η)
.
.
1 F-measurable f exists s.t. for any A ∈ F, µ(A) =
∫
A
f (t)dη(t)
2 for any A ∈ F, η(A) = 0 =⇒ µ(A) = 0
∫
A
f (t)dη(t) := sup
{Ai }
∑
i
[ inf
x∈Ai
f (x)]η(Ai )
dµ
dη
:= f is the density function w.r.t. η when µ is the probability measure.
/ 24

Generalization
When Y has a density function w.r.t. η s.t. µ ≪ η
B: the set in which Y takes values.
{Bj }∞
k=0 :
{
B0 := {B}
Bk+1 is a reﬁnement of Bk
For example, for B = N := {1, 2, · · · }, B0 = {B}
B1 := {{1}, {2, 3, · · · }}
B2 := {{1}, {2}, {3, 4, · · · }}
. . .
Bk := {{1}, {2}, · · · , {k}, {k + 1, k + 2, · · · }}
. . .
tk : B → Bk: y ∈ b ∈ Bk =⇒ tk(y) = b
η: µ ≪ η
fk(y) :=
Pk(tk(y))
η(tk(y))
for y ∈ B
/ 24

Generalization
Given yn = (y1, · · · , yn) ∈ Bn, s.t.
(tk(y1), · · · , tk(yn)) = (b1, · · · , bn) ∈ Bn
k ,
f n
k (yn
) := fk(y1) · · · fk(yn) =
Pk(b1) · · · Pk(bn)
η(b1) . . . η(bn)
gn
k (yn
) :=
Qn
k (b1, · · · , bn)
η(b1) · · · η(bn)
Qk: a universal Bayesian measure w.r.t. ﬁnite set Bk
Similarly,
1
n
log
f n(xn)
gn(xn)
→ 0
for any f s.t. h(fj ) → h(f ) as j → ∞
h(f ) :=
∫
−f (y) log f (y)dη(y)
/ 24

Generalization
Generalization
µn
(Dn
) :=
∫
D
f n
(yn
)dηn
(yn
) , Dn
∈ Bn
νn
(Dn
) :=
∫
D
gn
(yn
)dηn
(yn
) , Dn
∈ Bn
f n(yn)
gn(yn)
=
dµn
dηn
(yn
)/
dνn
dηn
(yn
) =
dµn
dνn
(yn
)
D(µ||ν) :=
∫
dµ log
dµ
dν
h(f ) :=
∫
−f (y) log f (y)dη(y)
= −
∫
dµ
dη
(y) log
dµ
dη
(y) · dη(y) = −D(µ||η)
/ 24

Generalization
Result 1
Proposition 1 (Suzuki, 2011)
.
If µ ≪ η, ν ≪ η exists s.t. νn(Rn) ≤ 1 and
1
n
log
dµn
dνn
(yn
) → 0
for any µ s.t. D(µk||η) → D(µ||η) as k → ∞.
/ 24

Generalization
The solution of the exercise in Introduction
{Aj × Bk}
gn
j,k(xn
, yn
) :=
Qn
j,k(a1, b1, · · · , an, bn)
λ(a1) · · · λ(an)η(b1) · · · η(bn)
gn
(xn
, yn
) :=
∑
j,k
wj,kgn
j,k(xn
, yn
) for {ωj,k} s.t.
∑
j,k
ωj,k = 1, ωj,k > 0
1
n
log
f n(xn, yn)
gn(xn, yn)
→ 0
Solution
We estimate
f n(xn, yn)
f n(xn)f n(yn)
by
gn(xn, yn)
gn(xn)gn(yn)
extending
Qn(xn, yn)
Qn(xn)Qn(yn)
.
/ 24

Generalization
Further generalization
Proposition 1 assumes
a speciﬁc histogram sequence {Bk}; and
µ should satisfy D(µk||η) → D(µ||η) as k → ∞

{Bk} should be universal
Construct {Bk} s.t. D(µk||η) → D(µ||η) as k → ∞ for any µ
/ 24

Universal Histogram Sequence
Universal histogram sequence {Bk}
µ, σ ∈ R, σ > 0.
{Ck}∞
k=0:
C0 = {(−∞, ∞)}
C1 = {(−∞, µ], (µ, ∞)}
C2 = {(−∞, µ − σ], (µ − σ, µ], (µ, µ + σ], (µ + σ, ∞)}
· · ·
Ck → Ck+1:


(−∞, µ − (k − 1)σ] → (−∞, µ − kσ], (µ − kσ, µ − (k − 1)σ]
(a, b] → (a, a+b
2 ], (a+b
2 , b]
(µ + (k − 1)σ, ∞) → (µ + (k − 1)σ, µ + kσ], (µ + kσ, ∞)

B: the set in which Y takes values
Bk := {B ∩ c|c ∈ Ck}{ϕ} .
/ 24

B = R and µ ≪ λ
{Bk} = {Ck}
For each y ∈ B, there exist K ∈ N and a unique {(ak, bk]}∞
k=K s.t.
{
y ∈ [ak, bk] ∈ Bk , k = K, K + 1, · · ·
|ak − bk| → 0 , k → ∞
FY : the distribution function of Y



fk(y) =
P(Y ∈ (ak, bk])
λ((ak, bk])
=
FY (bk) − FY (ak)
bk − ak
→ f (y) , y ∈ B
h(fk) → h(f )
as k → ∞ for any f
/ 24

B = N and µ ≪ η
B0 = {B}
B1 := {{1}, {2, 3, · · · }}
B2 := {{1}, {2}, {3, 4, · · · }}
. . .
Bk := {{1}, {2}, · · · , {k}, {k + 1, k + 2, · · · }}
. . .
can be obtained via µ = 1, σ = 1.
For each y ∈ B, there exists K ∈ N and a unique {Dk}∞
k=1 s.t.
{
y ∈ Dk ∈ Bk k = 1, 2, · · ·
{y} = Dk ∈ Bk, k = K, K + 1, · · ·



fk(y) =
P(Y ∈ Dk)
η(Dk)
→ f (y) =
P(Y = y)
η({y})
, y ∈ B
h(fk) → h(f )
as k → ∞ for any f
/ 24

Result 2
Theorem 1
.
.
If µ ≪ η, ν ≪ η exists s.t. νn(Rn) ≤ 1 and for any µ
1
n
log
dµn
dνn
(yn
) → 0
The proof is based on the following observation:
Billingeley: Probability & Measure, Problem 32.13
lim
h→0
µ((x − h, x + h])
η((x − h, x + h])
= f (x) , x ∈ R
to remove the condition Ryabko posed:
“for any µ s.t. D(µk||η) → D(µ||η) as k → ∞”
/ 24

Conclusion
Summary and Discussion
Universal Bayesian Measure
.
.the random variables may be either discrete or continuous
a universal histogram sequence to remove Ryabko’s condition
Many Applications
.
Bayesian network structure estimation (DCC 2012)
The Bayesian Chow-Liu Algorithm (PGM 2012)
Markov order estimation even when {Xi } is continuous
Extending MDL:
gn(yn|m): the universal Bayesian measure w.r.t. model m given yn ∈ Bn
pm: the prior probability of model m
− log gn
(yn
|m) − log pm → min
/ 24

MDL PRINCIPLE

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (8)

Similar to MDL PRINCIPLE

Similar to MDL PRINCIPLE (20)

More from Joe Suzuki

More from Joe Suzuki (20)

Recently uploaded

Recently uploaded (20)

MDL PRINCIPLE