Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

WITMSE 2013

434 views

Published on

The MDL principle for arbitrary data: either discrete or continuous or none of them
Sanjo-Kaikan, University of Tokyo, Japan
August 26, 2013

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

WITMSE 2013

  1. 1. . . The MDL principle for arbitrary data: either discrete or continuous or none of them Joe Suzuki Osaka University WITMSE 2013 Sanjo-Kaikan, University of Tokyo, Japan August 26, 2013 Joe Suzuki (Osaka University) The MDL principle for arbitrary data: either discrete or continuous or none of them WITMSE 2013Sanjo-Kaikan, University of To / 24
  2. 2. Road Map Road Map 1 Problem 2 The Ryabko measure 3 The Radon-Nikodym theorem 4 Generalization 5 Universal Histogram Sequence 6 Conclusion Joe Suzuki (Osaka University) The MDL principle for arbitrary data: either discrete or continuous or none of them WITMSE 2013Sanjo-Kaikan, University of To / 24
  3. 3. Road Map The slides of this talk can be seen via Internet keywords: Joe Suzuki slideshare http://www.slideshare.net/prof-joe/ Joe Suzuki (Osaka University) The MDL principle for arbitrary data: either discrete or continuous or none of them WITMSE 2013Sanjo-Kaikan, University of To / 24
  4. 4. Problem Given {(xi , yi )}n i=1, identify whether X ⊥⊥ Y or not A, B: finite sets xn = (x1, · · · , xn) ∈ An, yn = (y1, · · · , yn) ∈ Bn Pn(xn|θ), Pn(yn|θ), Pn(xn, yn|θ): expressed by parameter θ p: the prior probability of X ⊥⊥ Y Bayesian solution . . X ⊥⊥ Y ⇐⇒ pQn (xn )Qn (yn ) ≥ (1 − p)Qn (xn , yn ) Qn (xn ) := ∫ Pn (xn |θ)w(θ)dθ , Qn (yn ) := ∫ Pn (yn |θ)w(θ)dθ Qn (xn , yn ) := ∫ Pn (xn , yn |θ)w(θ)dθ using a weight w over θ. Joe Suzuki (Osaka University) The MDL principle for arbitrary data: either discrete or continuous or none of them WITMSE 2013Sanjo-Kaikan, University of To / 24
  5. 5. Problem Q should be an alternative to P as n grows A: the finite set in which X takes values. Q is a Bayesian measure . . Kraft’s inequality: ∑ xn∈An Qn (xn ) ≤ 1 (1)   For Example, Qn(xn) = |A|−n, xn ∈ An satisfies (1); but does not converges to Pn in any sense Joe Suzuki (Osaka University) The MDL principle for arbitrary data: either discrete or continuous or none of them WITMSE 2013Sanjo-Kaikan, University of To / 24
  6. 6. Problem Universal Bayesian Measures Qn (xn ) := ∫ Pn (xn |θ)w(θ)dθ w(θ) ∝ ∏ x∈A θ−a[x] with {a[x] = 1 2 }x∈A (Krichevsky-Trofimov) − 1 n log Qn (xn ) → H(P) for any Pn (xn |θ) = ∏ x∈A θ−c[x] with {c[x]}x∈A in xn ∈ An. Shannon McMillian Breiman: − 1 n log Pn (xn |θ) −→ H(P) for any stationary ergodic P, so that for Pn(xn) := Pn(xn|θ), 1 n log Pn(xn) Qn(xn) → 0 . (2) Joe Suzuki (Osaka University) The MDL principle for arbitrary data: either discrete or continuous or none of them WITMSE 2013Sanjo-Kaikan, University of To / 24
  7. 7. Problem When X has a density function f There exists a g s.t. ∫ xn∈Rn gn (xn ) ≤ 1 (3) 1 n log f n(xn) gn(xn) → 0 (4) for any f satisfying a condition mentioned later (Ryabko 2009). Joe Suzuki (Osaka University) The MDL principle for arbitrary data: either discrete or continuous or none of them WITMSE 2013Sanjo-Kaikan, University of To / 24
  8. 8. Problem The problem in this paper Universal Bayesian measure in the general settings . . What are (1)(2) and (3)(4) for general random variables ? 1 without assuming either discrete or continuous 2 removing the constraint Ryabko poses: Joe Suzuki (Osaka University) The MDL principle for arbitrary data: either discrete or continuous or none of them WITMSE 2013Sanjo-Kaikan, University of To / 24
  9. 9. The Ryabko measure Ryabko measure: X has a density function f A: the set in which X takes values. {Aj }∞ j=0 : { A0 := {A} Aj+1 is a refinement of Aj For example, for A = [0, 1), A0 = {[0, 1)} A1 = {[0, 1/2), [1/2, 1)} A2 = {[0, 1/4), [1/4, 1/2), [1/2, 3/4), [3/4, 1)} . . . Aj = {[0, 2−(j−1)), [2−(j−1), 2 · 2−(j−1)), · · · , [(2j−1 − 1)2−(j−1), 1)} . . . sj : A → Aj : x ∈ a ∈ Aj =⇒ sj (x) = a λ: the Lebesgue measure fj (x) := Pj (sj (x)) λ(sj (x)) for x ∈ A Joe Suzuki (Osaka University) The MDL principle for arbitrary data: either discrete or continuous or none of them WITMSE 2013Sanjo-Kaikan, University of To / 24
  10. 10. The Ryabko measure Given xn = (x1, · · · , xn) ∈ An s.t. (sj (x1), · · · , sj (xn)) = (a1, · · · , an) ∈ An j , f n j (xn ) := fj (x1) · · · fj (xn) = Pj (a1) · · · Pj (an) λ(a1) . . . λ(an) . gn j (xn ) := Qn j (a1, · · · , an) λ(a1) · · · λ(an) Qj : a universal Bayesian measure w.r.t. finite set Aj .   f n(xn) := f (x1) · · · f (xn) gn (xn ) := ∞∑ j=0 wj gn j (xn ) for {ωj }∞ j=1 s.t. ∑ j ωj = 1, ωj > 0 1 n log f n(xn) gn(xn) → 0 for any f s.t. differential entropy h(fj ) → h(f ) as j → ∞ (Ryabko, 2009) Joe Suzuki (Osaka University) The MDL principle for arbitrary data: either discrete or continuous or none of them WITMSE 2013Sanjo-Kaikan, University of To / 24
  11. 11. The Radon-Nikodym theorem In general, exactly when a density function exists ? B: the entire Borel sets of R µ(D) := P(X ∈ D): the probability of (X ∈ D) for D ∈ B FX : the distribution function of X µ is absolutely continuous w.r.t. λ (µ ≪ λ) . The following two are equivalent: 1 f : R → R exists s.t. P(X ≤ x) = FX (x) = ∫ t≤x f (t)dt 2 for any D ∈ B, λ(D) := ∫ D dx = 0 =⇒ µ(D) = 0. f (x) = dFX (x) dx Joe Suzuki (Osaka University) The MDL principle for arbitrary data: either discrete or continuous or none of them WITMSE 2013Sanjo-Kaikan, University of To / 24
  12. 12. The Radon-Nikodym theorem Even discrete variables have density functions! B: a countable subset of R µ(D) := P(X ∈ D): the probability of (X ∈ D) for D ⊆ B r : B → R µ is absolutely continuous w.r.t. η (µ ≪ η) . 1 f : B → R exists s.t. P(X ∈ D) = ∑ x∈D f (x)r(x), D ⊆ B 2 for any D ⊆ B, η(D) := ∑ x∈D r(x) = 0 =⇒ µ(D) = 0. f (x) = P(X = x) r(x) Joe Suzuki (Osaka University) The MDL principle for arbitrary data: either discrete or continuous or none of them WITMSE 2013Sanjo-Kaikan, University of To / 24
  13. 13. The Radon-Nikodym theorem Radon-Nikodym µ, η: σ-finite measures over σ-field F µ is absolutely continuous w.r.t. η (µ ≪ η) . . 1 F-measurable f exists s.t. for any A ∈ F, µ(A) = ∫ A f (t)dη(t) 2 for any A ∈ F, η(A) = 0 =⇒ µ(A) = 0 ∫ A f (t)dη(t) := sup {Ai } ∑ i [ inf x∈Ai f (x)]η(Ai ) dµ dη := f is the density function w.r.t. η when µ is the probability measure. Joe Suzuki (Osaka University) The MDL principle for arbitrary data: either discrete or continuous or none of them WITMSE 2013Sanjo-Kaikan, University of To / 24
  14. 14. Generalization When Y has a density function w.r.t. η s.t. µ ≪ η B: the set in which Y takes values. {Bj }∞ k=0 : { B0 := {B} Bk+1 is a refinement of Bk For example, for B = N := {1, 2, · · · }, B0 = {B} B1 := {{1}, {2, 3, · · · }} B2 := {{1}, {2}, {3, 4, · · · }} . . . Bk := {{1}, {2}, · · · , {k}, {k + 1, k + 2, · · · }} . . . tk : B → Bk: y ∈ b ∈ Bk =⇒ tk(y) = b η: µ ≪ η fk(y) := Pk(tk(y)) η(tk(y)) for y ∈ B Joe Suzuki (Osaka University) The MDL principle for arbitrary data: either discrete or continuous or none of them WITMSE 2013Sanjo-Kaikan, University of To / 24
  15. 15. Generalization Given yn = (y1, · · · , yn) ∈ Bn, s.t. (tk(y1), · · · , tk(yn)) = (b1, · · · , bn) ∈ Bn k , f n k (yn ) := fk(y1) · · · fk(yn) = Pk(b1) · · · Pk(bn) η(b1) . . . η(bn) gn k (yn ) := Qn k (b1, · · · , bn) η(b1) · · · η(bn) Qk: a universal Bayesian measure w.r.t. finite set Bk Similarly, 1 n log f n(xn) gn(xn) → 0 for any f s.t. h(fj ) → h(f ) as j → ∞ h(f ) := ∫ −f (y) log f (y)dη(y) Joe Suzuki (Osaka University) The MDL principle for arbitrary data: either discrete or continuous or none of them WITMSE 2013Sanjo-Kaikan, University of To / 24
  16. 16. Generalization Generalization µn (Dn ) := ∫ D f n (yn )dηn (yn ) , Dn ∈ Bn νn (Dn ) := ∫ D gn (yn )dηn (yn ) , Dn ∈ Bn f n(yn) gn(yn) = dµn dηn (yn )/ dνn dηn (yn ) = dµn dνn (yn ) D(µ||ν) := ∫ dµ log dµ dν h(f ) := ∫ −f (y) log f (y)dη(y) = − ∫ dµ dη (y) log dµ dη (y) · dη(y) = −D(µ||η) Joe Suzuki (Osaka University) The MDL principle for arbitrary data: either discrete or continuous or none of them WITMSE 2013Sanjo-Kaikan, University of To / 24
  17. 17. Generalization Result 1 Proposition 1 (Suzuki, 2011) . If µ ≪ η, ν ≪ η exists s.t. νn(Rn) ≤ 1 and 1 n log dµn dνn (yn ) → 0 for any µ s.t. D(µk||η) → D(µ||η) as k → ∞. Joe Suzuki (Osaka University) The MDL principle for arbitrary data: either discrete or continuous or none of them WITMSE 2013Sanjo-Kaikan, University of To / 24
  18. 18. Generalization The solution of the exercise in Introduction {Aj × Bk} gn j,k(xn , yn ) := Qn j,k(a1, b1, · · · , an, bn) λ(a1) · · · λ(an)η(b1) · · · η(bn) gn (xn , yn ) := ∑ j,k wj,kgn j,k(xn , yn ) for {ωj,k} s.t. ∑ j,k ωj,k = 1, ωj,k > 0 1 n log f n(xn, yn) gn(xn, yn) → 0 Solution We estimate f n(xn, yn) f n(xn)f n(yn) by gn(xn, yn) gn(xn)gn(yn) extending Qn(xn, yn) Qn(xn)Qn(yn) . Joe Suzuki (Osaka University) The MDL principle for arbitrary data: either discrete or continuous or none of them WITMSE 2013Sanjo-Kaikan, University of To / 24
  19. 19. Generalization Further generalization Proposition 1 assumes a specific histogram sequence {Bk}; and µ should satisfy D(µk||η) → D(µ||η) as k → ∞   {Bk} should be universal Construct {Bk} s.t. D(µk||η) → D(µ||η) as k → ∞ for any µ Joe Suzuki (Osaka University) The MDL principle for arbitrary data: either discrete or continuous or none of them WITMSE 2013Sanjo-Kaikan, University of To / 24
  20. 20. Universal Histogram Sequence Universal histogram sequence {Bk} µ, σ ∈ R, σ > 0. {Ck}∞ k=0: C0 = {(−∞, ∞)} C1 = {(−∞, µ], (µ, ∞)} C2 = {(−∞, µ − σ], (µ − σ, µ], (µ, µ + σ], (µ + σ, ∞)} · · · Ck → Ck+1:   (−∞, µ − (k − 1)σ] → (−∞, µ − kσ], (µ − kσ, µ − (k − 1)σ] (a, b] → (a, a+b 2 ], (a+b 2 , b] (µ + (k − 1)σ, ∞) → (µ + (k − 1)σ, µ + kσ], (µ + kσ, ∞)   B: the set in which Y takes values Bk := {B ∩ c|c ∈ Ck}{ϕ} . Joe Suzuki (Osaka University) The MDL principle for arbitrary data: either discrete or continuous or none of them WITMSE 2013Sanjo-Kaikan, University of To / 24
  21. 21. Universal Histogram Sequence B = R and µ ≪ λ {Bk} = {Ck} For each y ∈ B, there exist K ∈ N and a unique {(ak, bk]}∞ k=K s.t. { y ∈ [ak, bk] ∈ Bk , k = K, K + 1, · · · |ak − bk| → 0 , k → ∞ FY : the distribution function of Y    fk(y) = P(Y ∈ (ak, bk]) λ((ak, bk]) = FY (bk) − FY (ak) bk − ak → f (y) , y ∈ B h(fk) → h(f ) as k → ∞ for any f Joe Suzuki (Osaka University) The MDL principle for arbitrary data: either discrete or continuous or none of them WITMSE 2013Sanjo-Kaikan, University of To / 24
  22. 22. Universal Histogram Sequence B = N and µ ≪ η B0 = {B} B1 := {{1}, {2, 3, · · · }} B2 := {{1}, {2}, {3, 4, · · · }} . . . Bk := {{1}, {2}, · · · , {k}, {k + 1, k + 2, · · · }} . . . can be obtained via µ = 1, σ = 1. For each y ∈ B, there exists K ∈ N and a unique {Dk}∞ k=1 s.t. { y ∈ Dk ∈ Bk k = 1, 2, · · · {y} = Dk ∈ Bk, k = K, K + 1, · · ·    fk(y) = P(Y ∈ Dk) η(Dk) → f (y) = P(Y = y) η({y}) , y ∈ B h(fk) → h(f ) as k → ∞ for any f Joe Suzuki (Osaka University) The MDL principle for arbitrary data: either discrete or continuous or none of them WITMSE 2013Sanjo-Kaikan, University of To / 24
  23. 23. Universal Histogram Sequence Result 2 Theorem 1 . . If µ ≪ η, ν ≪ η exists s.t. νn(Rn) ≤ 1 and for any µ 1 n log dµn dνn (yn ) → 0 The proof is based on the following observation: Billingeley: Probability & Measure, Problem 32.13 lim h→0 µ((x − h, x + h]) η((x − h, x + h]) = f (x) , x ∈ R to remove the condition Ryabko posed: “for any µ s.t. D(µk||η) → D(µ||η) as k → ∞” Joe Suzuki (Osaka University) The MDL principle for arbitrary data: either discrete or continuous or none of them WITMSE 2013Sanjo-Kaikan, University of To / 24
  24. 24. Conclusion Summary and Discussion Universal Bayesian Measure . .the random variables may be either discrete or continuous a universal histogram sequence to remove Ryabko’s condition Many Applications . Bayesian network structure estimation (DCC 2012) The Bayesian Chow-Liu Algorithm (PGM 2012) Markov order estimation even when {Xi } is continuous Extending MDL: gn(yn|m): the universal Bayesian measure w.r.t. model m given yn ∈ Bn pm: the prior probability of model m − log gn (yn |m) − log pm → min Joe Suzuki (Osaka University) The MDL principle for arbitrary data: either discrete or continuous or none of them WITMSE 2013Sanjo-Kaikan, University of To / 24

×