Successfully reported this slideshow.   ×

1 of 24
1 of 24

# WITMSE 2013

The MDL principle for arbitrary data: either discrete or continuous or none of them
Sanjo-Kaikan, University of Tokyo, Japan
August 26, 2013

The MDL principle for arbitrary data: either discrete or continuous or none of them
Sanjo-Kaikan, University of Tokyo, Japan
August 26, 2013

### WITMSE 2013

1. 1. . . The MDL principle for arbitrary data: either discrete or continuous or none of them Joe Suzuki Osaka University WITMSE 2013 Sanjo-Kaikan, University of Tokyo, Japan August 26, 2013 Joe Suzuki (Osaka University) The MDL principle for arbitrary data: either discrete or continuous or none of them WITMSE 2013Sanjo-Kaikan, University of To / 24
2. 2. Road Map Road Map 1 Problem 2 The Ryabko measure 3 The Radon-Nikodym theorem 4 Generalization 5 Universal Histogram Sequence 6 Conclusion Joe Suzuki (Osaka University) The MDL principle for arbitrary data: either discrete or continuous or none of them WITMSE 2013Sanjo-Kaikan, University of To / 24
3. 3. Road Map The slides of this talk can be seen via Internet keywords: Joe Suzuki slideshare http://www.slideshare.net/prof-joe/ Joe Suzuki (Osaka University) The MDL principle for arbitrary data: either discrete or continuous or none of them WITMSE 2013Sanjo-Kaikan, University of To / 24
4. 4. Problem Given {(xi , yi )}n i=1, identify whether X ⊥⊥ Y or not A, B: ﬁnite sets xn = (x1, · · · , xn) ∈ An, yn = (y1, · · · , yn) ∈ Bn Pn(xn|θ), Pn(yn|θ), Pn(xn, yn|θ): expressed by parameter θ p: the prior probability of X ⊥⊥ Y Bayesian solution . . X ⊥⊥ Y ⇐⇒ pQn (xn )Qn (yn ) ≥ (1 − p)Qn (xn , yn ) Qn (xn ) := ∫ Pn (xn |θ)w(θ)dθ , Qn (yn ) := ∫ Pn (yn |θ)w(θ)dθ Qn (xn , yn ) := ∫ Pn (xn , yn |θ)w(θ)dθ using a weight w over θ. Joe Suzuki (Osaka University) The MDL principle for arbitrary data: either discrete or continuous or none of them WITMSE 2013Sanjo-Kaikan, University of To / 24
5. 5. Problem Q should be an alternative to P as n grows A: the ﬁnite set in which X takes values. Q is a Bayesian measure . . Kraft’s inequality: ∑ xn∈An Qn (xn ) ≤ 1 (1)   For Example, Qn(xn) = |A|−n, xn ∈ An satisﬁes (1); but does not converges to Pn in any sense Joe Suzuki (Osaka University) The MDL principle for arbitrary data: either discrete or continuous or none of them WITMSE 2013Sanjo-Kaikan, University of To / 24
6. 6. Problem Universal Bayesian Measures Qn (xn ) := ∫ Pn (xn |θ)w(θ)dθ w(θ) ∝ ∏ x∈A θ−a[x] with {a[x] = 1 2 }x∈A (Krichevsky-Troﬁmov) − 1 n log Qn (xn ) → H(P) for any Pn (xn |θ) = ∏ x∈A θ−c[x] with {c[x]}x∈A in xn ∈ An. Shannon McMillian Breiman: − 1 n log Pn (xn |θ) −→ H(P) for any stationary ergodic P, so that for Pn(xn) := Pn(xn|θ), 1 n log Pn(xn) Qn(xn) → 0 . (2) Joe Suzuki (Osaka University) The MDL principle for arbitrary data: either discrete or continuous or none of them WITMSE 2013Sanjo-Kaikan, University of To / 24
7. 7. Problem When X has a density function f There exists a g s.t. ∫ xn∈Rn gn (xn ) ≤ 1 (3) 1 n log f n(xn) gn(xn) → 0 (4) for any f satisfying a condition mentioned later (Ryabko 2009). Joe Suzuki (Osaka University) The MDL principle for arbitrary data: either discrete or continuous or none of them WITMSE 2013Sanjo-Kaikan, University of To / 24
8. 8. Problem The problem in this paper Universal Bayesian measure in the general settings . . What are (1)(2) and (3)(4) for general random variables ? 1 without assuming either discrete or continuous 2 removing the constraint Ryabko poses: Joe Suzuki (Osaka University) The MDL principle for arbitrary data: either discrete or continuous or none of them WITMSE 2013Sanjo-Kaikan, University of To / 24
9. 9. The Ryabko measure Ryabko measure: X has a density function f A: the set in which X takes values. {Aj }∞ j=0 : { A0 := {A} Aj+1 is a reﬁnement of Aj For example, for A = [0, 1), A0 = {[0, 1)} A1 = {[0, 1/2), [1/2, 1)} A2 = {[0, 1/4), [1/4, 1/2), [1/2, 3/4), [3/4, 1)} . . . Aj = {[0, 2−(j−1)), [2−(j−1), 2 · 2−(j−1)), · · · , [(2j−1 − 1)2−(j−1), 1)} . . . sj : A → Aj : x ∈ a ∈ Aj =⇒ sj (x) = a λ: the Lebesgue measure fj (x) := Pj (sj (x)) λ(sj (x)) for x ∈ A Joe Suzuki (Osaka University) The MDL principle for arbitrary data: either discrete or continuous or none of them WITMSE 2013Sanjo-Kaikan, University of To / 24
10. 10. The Ryabko measure Given xn = (x1, · · · , xn) ∈ An s.t. (sj (x1), · · · , sj (xn)) = (a1, · · · , an) ∈ An j , f n j (xn ) := fj (x1) · · · fj (xn) = Pj (a1) · · · Pj (an) λ(a1) . . . λ(an) . gn j (xn ) := Qn j (a1, · · · , an) λ(a1) · · · λ(an) Qj : a universal Bayesian measure w.r.t. ﬁnite set Aj .   f n(xn) := f (x1) · · · f (xn) gn (xn ) := ∞∑ j=0 wj gn j (xn ) for {ωj }∞ j=1 s.t. ∑ j ωj = 1, ωj > 0 1 n log f n(xn) gn(xn) → 0 for any f s.t. diﬀerential entropy h(fj ) → h(f ) as j → ∞ (Ryabko, 2009) Joe Suzuki (Osaka University) The MDL principle for arbitrary data: either discrete or continuous or none of them WITMSE 2013Sanjo-Kaikan, University of To / 24
11. 11. The Radon-Nikodym theorem In general, exactly when a density function exists ? B: the entire Borel sets of R µ(D) := P(X ∈ D): the probability of (X ∈ D) for D ∈ B FX : the distribution function of X µ is absolutely continuous w.r.t. λ (µ ≪ λ) . The following two are equivalent: 1 f : R → R exists s.t. P(X ≤ x) = FX (x) = ∫ t≤x f (t)dt 2 for any D ∈ B, λ(D) := ∫ D dx = 0 =⇒ µ(D) = 0. f (x) = dFX (x) dx Joe Suzuki (Osaka University) The MDL principle for arbitrary data: either discrete or continuous or none of them WITMSE 2013Sanjo-Kaikan, University of To / 24
12. 12. The Radon-Nikodym theorem Even discrete variables have density functions! B: a countable subset of R µ(D) := P(X ∈ D): the probability of (X ∈ D) for D ⊆ B r : B → R µ is absolutely continuous w.r.t. η (µ ≪ η) . 1 f : B → R exists s.t. P(X ∈ D) = ∑ x∈D f (x)r(x), D ⊆ B 2 for any D ⊆ B, η(D) := ∑ x∈D r(x) = 0 =⇒ µ(D) = 0. f (x) = P(X = x) r(x) Joe Suzuki (Osaka University) The MDL principle for arbitrary data: either discrete or continuous or none of them WITMSE 2013Sanjo-Kaikan, University of To / 24
13. 13. The Radon-Nikodym theorem Radon-Nikodym µ, η: σ-ﬁnite measures over σ-ﬁeld F µ is absolutely continuous w.r.t. η (µ ≪ η) . . 1 F-measurable f exists s.t. for any A ∈ F, µ(A) = ∫ A f (t)dη(t) 2 for any A ∈ F, η(A) = 0 =⇒ µ(A) = 0 ∫ A f (t)dη(t) := sup {Ai } ∑ i [ inf x∈Ai f (x)]η(Ai ) dµ dη := f is the density function w.r.t. η when µ is the probability measure. Joe Suzuki (Osaka University) The MDL principle for arbitrary data: either discrete or continuous or none of them WITMSE 2013Sanjo-Kaikan, University of To / 24
14. 14. Generalization When Y has a density function w.r.t. η s.t. µ ≪ η B: the set in which Y takes values. {Bj }∞ k=0 : { B0 := {B} Bk+1 is a reﬁnement of Bk For example, for B = N := {1, 2, · · · }, B0 = {B} B1 := {{1}, {2, 3, · · · }} B2 := {{1}, {2}, {3, 4, · · · }} . . . Bk := {{1}, {2}, · · · , {k}, {k + 1, k + 2, · · · }} . . . tk : B → Bk: y ∈ b ∈ Bk =⇒ tk(y) = b η: µ ≪ η fk(y) := Pk(tk(y)) η(tk(y)) for y ∈ B Joe Suzuki (Osaka University) The MDL principle for arbitrary data: either discrete or continuous or none of them WITMSE 2013Sanjo-Kaikan, University of To / 24
15. 15. Generalization Given yn = (y1, · · · , yn) ∈ Bn, s.t. (tk(y1), · · · , tk(yn)) = (b1, · · · , bn) ∈ Bn k , f n k (yn ) := fk(y1) · · · fk(yn) = Pk(b1) · · · Pk(bn) η(b1) . . . η(bn) gn k (yn ) := Qn k (b1, · · · , bn) η(b1) · · · η(bn) Qk: a universal Bayesian measure w.r.t. ﬁnite set Bk Similarly, 1 n log f n(xn) gn(xn) → 0 for any f s.t. h(fj ) → h(f ) as j → ∞ h(f ) := ∫ −f (y) log f (y)dη(y) Joe Suzuki (Osaka University) The MDL principle for arbitrary data: either discrete or continuous or none of them WITMSE 2013Sanjo-Kaikan, University of To / 24
16. 16. Generalization Generalization µn (Dn ) := ∫ D f n (yn )dηn (yn ) , Dn ∈ Bn νn (Dn ) := ∫ D gn (yn )dηn (yn ) , Dn ∈ Bn f n(yn) gn(yn) = dµn dηn (yn )/ dνn dηn (yn ) = dµn dνn (yn ) D(µ||ν) := ∫ dµ log dµ dν h(f ) := ∫ −f (y) log f (y)dη(y) = − ∫ dµ dη (y) log dµ dη (y) · dη(y) = −D(µ||η) Joe Suzuki (Osaka University) The MDL principle for arbitrary data: either discrete or continuous or none of them WITMSE 2013Sanjo-Kaikan, University of To / 24
17. 17. Generalization Result 1 Proposition 1 (Suzuki, 2011) . If µ ≪ η, ν ≪ η exists s.t. νn(Rn) ≤ 1 and 1 n log dµn dνn (yn ) → 0 for any µ s.t. D(µk||η) → D(µ||η) as k → ∞. Joe Suzuki (Osaka University) The MDL principle for arbitrary data: either discrete or continuous or none of them WITMSE 2013Sanjo-Kaikan, University of To / 24
18. 18. Generalization The solution of the exercise in Introduction {Aj × Bk} gn j,k(xn , yn ) := Qn j,k(a1, b1, · · · , an, bn) λ(a1) · · · λ(an)η(b1) · · · η(bn) gn (xn , yn ) := ∑ j,k wj,kgn j,k(xn , yn ) for {ωj,k} s.t. ∑ j,k ωj,k = 1, ωj,k > 0 1 n log f n(xn, yn) gn(xn, yn) → 0 Solution We estimate f n(xn, yn) f n(xn)f n(yn) by gn(xn, yn) gn(xn)gn(yn) extending Qn(xn, yn) Qn(xn)Qn(yn) . Joe Suzuki (Osaka University) The MDL principle for arbitrary data: either discrete or continuous or none of them WITMSE 2013Sanjo-Kaikan, University of To / 24
19. 19. Generalization Further generalization Proposition 1 assumes a speciﬁc histogram sequence {Bk}; and µ should satisfy D(µk||η) → D(µ||η) as k → ∞   {Bk} should be universal Construct {Bk} s.t. D(µk||η) → D(µ||η) as k → ∞ for any µ Joe Suzuki (Osaka University) The MDL principle for arbitrary data: either discrete or continuous or none of them WITMSE 2013Sanjo-Kaikan, University of To / 24
20. 20. Universal Histogram Sequence Universal histogram sequence {Bk} µ, σ ∈ R, σ > 0. {Ck}∞ k=0: C0 = {(−∞, ∞)} C1 = {(−∞, µ], (µ, ∞)} C2 = {(−∞, µ − σ], (µ − σ, µ], (µ, µ + σ], (µ + σ, ∞)} · · · Ck → Ck+1:   (−∞, µ − (k − 1)σ] → (−∞, µ − kσ], (µ − kσ, µ − (k − 1)σ] (a, b] → (a, a+b 2 ], (a+b 2 , b] (µ + (k − 1)σ, ∞) → (µ + (k − 1)σ, µ + kσ], (µ + kσ, ∞)   B: the set in which Y takes values Bk := {B ∩ c|c ∈ Ck}{ϕ} . Joe Suzuki (Osaka University) The MDL principle for arbitrary data: either discrete or continuous or none of them WITMSE 2013Sanjo-Kaikan, University of To / 24
21. 21. Universal Histogram Sequence B = R and µ ≪ λ {Bk} = {Ck} For each y ∈ B, there exist K ∈ N and a unique {(ak, bk]}∞ k=K s.t. { y ∈ [ak, bk] ∈ Bk , k = K, K + 1, · · · |ak − bk| → 0 , k → ∞ FY : the distribution function of Y    fk(y) = P(Y ∈ (ak, bk]) λ((ak, bk]) = FY (bk) − FY (ak) bk − ak → f (y) , y ∈ B h(fk) → h(f ) as k → ∞ for any f Joe Suzuki (Osaka University) The MDL principle for arbitrary data: either discrete or continuous or none of them WITMSE 2013Sanjo-Kaikan, University of To / 24
22. 22. Universal Histogram Sequence B = N and µ ≪ η B0 = {B} B1 := {{1}, {2, 3, · · · }} B2 := {{1}, {2}, {3, 4, · · · }} . . . Bk := {{1}, {2}, · · · , {k}, {k + 1, k + 2, · · · }} . . . can be obtained via µ = 1, σ = 1. For each y ∈ B, there exists K ∈ N and a unique {Dk}∞ k=1 s.t. { y ∈ Dk ∈ Bk k = 1, 2, · · · {y} = Dk ∈ Bk, k = K, K + 1, · · ·    fk(y) = P(Y ∈ Dk) η(Dk) → f (y) = P(Y = y) η({y}) , y ∈ B h(fk) → h(f ) as k → ∞ for any f Joe Suzuki (Osaka University) The MDL principle for arbitrary data: either discrete or continuous or none of them WITMSE 2013Sanjo-Kaikan, University of To / 24
23. 23. Universal Histogram Sequence Result 2 Theorem 1 . . If µ ≪ η, ν ≪ η exists s.t. νn(Rn) ≤ 1 and for any µ 1 n log dµn dνn (yn ) → 0 The proof is based on the following observation: Billingeley: Probability & Measure, Problem 32.13 lim h→0 µ((x − h, x + h]) η((x − h, x + h]) = f (x) , x ∈ R to remove the condition Ryabko posed: “for any µ s.t. D(µk||η) → D(µ||η) as k → ∞” Joe Suzuki (Osaka University) The MDL principle for arbitrary data: either discrete or continuous or none of them WITMSE 2013Sanjo-Kaikan, University of To / 24
24. 24. Conclusion Summary and Discussion Universal Bayesian Measure . .the random variables may be either discrete or continuous a universal histogram sequence to remove Ryabko’s condition Many Applications . Bayesian network structure estimation (DCC 2012) The Bayesian Chow-Liu Algorithm (PGM 2012) Markov order estimation even when {Xi } is continuous Extending MDL: gn(yn|m): the universal Bayesian measure w.r.t. model m given yn ∈ Bn pm: the prior probability of model m − log gn (yn |m) − log pm → min Joe Suzuki (Osaka University) The MDL principle for arbitrary data: either discrete or continuous or none of them WITMSE 2013Sanjo-Kaikan, University of To / 24