- 2. 1 2 3 2 / 74
- 3. talk 1 If (x) = − ln f(x) H(f) = − f(x) ln f(x)dx f X H(X) 1 Shannon Renyi ((1 − α)−1 log f(x)α dx) Tsallis ((q − 1)−1 (1 − fq (x)dx)) ( ) 3 / 74
- 4. talk H(f, g) =Ef [Ig(X)] = − f(x) ln g(x)dx, H(f) =Ef [If (X)] = − f(x) ln f(x)dx Kullback-Leibler DKL(f, g) = Ef [Ig(X)] − Ef [If (X)] = f(x) ln f(x) g(x) dx MI(X, Y ) = H(X) + H(Y ) − H(X, Y ) H(X, Y ) X Y 4 / 74
- 5. KL 5 / 74
- 6. KL 5 / 74
- 7. m Y ∈ Rm n X ∈ Rn Y W ∈ Rn×m : Y = WX. (1) Y W WX f(WX) WX (m ) f(wjX), j = 1, . . . , m W [Hyv¨arinen&Oja, 2000] 6 / 74
- 8. 7 / 74
- 9. k L(c1, . . . , cK) = n i=1 min l=1,...,K ∥xi − cl∥2 . 8 / 74
- 10. k L(c1, . . . , cK) = n i=1 min l=1,...,K ∥xi − cl∥2 . A Nonparametric Information Theoretic Clustering Algorithm −8 −6 −4 −2 0 2 4 6 8 −8 −6 −4 −2 0 2 4 6 8 −30 −20 −10 0 10 20 30 −30 −20 −10 0 10 20 30 −15 −10 −5 0 5 10 15 −15 −10 −5 0 5 10 15 (a) (b) (c) −8 −6 −4 −2 0 2 4 6 8 −8 −6 −4 −2 0 2 4 6 8 −30 −20 −10 0 10 20 30 −30 −20 −10 0 10 20 30 −15 −10 −5 0 5 10 15 −15 −10 −5 0 5 10 15 (d) (e) (f) Figure 2. Comparison of the proposed clustering method NIC and the k-means clustering algorithm on thr cases. (a)-(c) NIC, (d)-(f) k-means. Fig. from [Faivishevsky&Goldberger, 2010] 8 / 74
- 11. H(X|Y ) 9 / 74
- 12. H(X|Y ) A Nonparametric Information Theoretic Clustering Algorithm −8 −6 −4 −2 0 2 4 6 8 −8 −6 −4 −2 0 2 4 6 8 −30 −20 −10 0 10 20 30 −30 −20 −10 0 10 20 30 −15 −10 −5 0 5 10 15 −15 −10 −5 0 5 10 15 (a) (b) (c) −8 −6 −4 −2 0 2 4 6 8 −8 −6 −4 −2 0 2 4 6 8 −30 −20 −10 0 10 20 30 −30 −20 −10 0 10 20 30 −15 −10 −5 0 5 10 15 −15 −10 −5 0 5 10 15 (d) (e) (f) 9 / 74
- 13. H(X|Y ) Fisher [Hino&Murata, 2010] 10 / 74
- 14. −3 −2 −1 0 1 2 3 −3−2−10123 1st axis 2ndaxis LDA minH −3 −2 −1 0 1 2 3 −3−2−10123 1st axis 2ndaxis −3 −2 −1 0 1 2 3 −3−2−10123 1st axis 2ndaxis LDA minH −3 −2 −1 0 1 2 3 −3−2−10123 1st axis 2ndaxis 11 / 74
- 15. 12 / 74
- 16. 12 / 74
- 17. European single market completed The Great Hanshion- Awaji Earthquake decay of bubble economy the Gulf war TOPIX ChangePointScore 10001500200025003000 0.000.020.040.060.080.10 1988!02!01 1988!09!01 1989!05!01 1989!12!01 1990!08!01 1991!04!01 1992!04!01 1992!10!01 1993!06!01 1993!12!01 1994!07!01 1995!02!01 1995!09!01 1996!04!01 : score(t) = log fafter(t) fbefore(t) . [Murata+, 2013, Koshijima+, 2015] 13 / 74
- 19. ( ) 15 / 74
- 20. ( ) 15 / 74
- 21. ( ) Vapnik 15 / 74
- 22. 16 / 74
- 23. 17 / 74
- 24. 18 / 74
- 25. 1 2 3 19 / 74
- 26. D = {xi}n i=1 ⊂ R 1 D i.i.d. 20 / 74
- 27. f(x) = 5 8 φ(x; µ = 0, σ = 1) + 3 8 φ(x; µ = 3, σ = 1) 21 / 74
- 28. f(x) = 5 8 φ(x; µ = 0, σ = 1) + 3 8 φ(x; µ = 3, σ = 1) 21 / 74
- 29. 22 / 74
- 30. 23 / 74
- 31. ˆf(x; h) = 1 nh n i=1 κ((x − xi)/h) (2) κ κ(x)dx = 1 h > 0 κh(x) = h−1κ(x/h) ˆf(x; h) = 1 n n i=1 κh(x − xi) 23 / 74
- 32. κ N(0, 1) 24 / 74
- 33. κ N(0, 1) 24 / 74
- 34. κ N(0, 1) 24 / 74
- 35. x MSE(mean squared error): ˆθ MSE(ˆθ) = E[(ˆθ − θ)2 ] = Var[ˆθ] + (E[ˆθ] − θ)2 E[ ˆf(x; h)] = E[κh(x − X)] = κh(x − y)f(y)dy (f ∗ g)(x) = f(x − y)g(y)dy ˆf(x; h) E[ ˆf(x; h)] − f(x) = (κh ∗ f)(x) − f(x). Var[ ˆf(x; h)] = 1 n (κ2 h ∗ f)(x) − (κh ∗ f)2 (x) 25 / 74
- 36. x MSE[ ˆf(x; h)] = 1 n (κ2 h ∗ f)(x) − (κh ∗ f)2 (x) + {(κh ∗ f)(x) − f(x)}2 26 / 74
- 37. L2 ( ) : ISE(integrated squared error) ISE[ ˆf(·; h)] = ˆf(x; h) − f(x) 2 dx 27 / 74
- 38. ˆf(x; h) D = {xi}n i=1 ISE ˆf D MISE(mean integrated squared error) MISE[ ˆf(·; h)] =ED[ISE[ ˆf(·; h, D)]] = ED([ ˆf(x; h, D) − f(x)])2 dx = MSE[ ˆf(x; h, D)]dx 28 / 74
- 39. MISE[ ˆf(·; h)] =n−1 (κ2 h ∗ f)(x) − (κh ∗ f)2 (x) dx + {(κh ∗ f)(x) − f(x)}2 dx =(nh)−1 κ2 (x)dx + (1 − n−1 ) (κh ∗ f)2 (x)dx − 2 (κh ∗ f)(x)f(x)dx + f(x)2 dx. 29 / 74
- 41. 1 f C2- L2 2 {hn} hn n h n : lim n→∞ h = 0, lim n→∞ nh = ∞. 3 κ 4 κ(x)dx = 1, xκ(x)dx = 0, µ2(κ) = x2 κ(x)dx < ∞ 31 / 74
- 42. E[ ˆf(x; h)] = κ(z)f(x − hz)dz f(x − hz) f(x − hz) = f(x) − hzf′ (x) + 1 2 h2 z2 f′′ (x) + o(h2 ) E[ ˆf(x; h)] = f(x) + 1 2 h2 f′′ (x) z2 κ(z)dz + o(h2 ) E[ ˆf(x; h)] − f(x) = 1 2 h2 µ2(κ)f′′ (x) + o(h2 ) (3) ˆf f 32 / 74
- 43. g R(g) = g2(x)dx Var[ ˆf(x; h)] = (nh)−1 R(κ)f(x) + o((nh)−1 ) (4) (2) (3) 0 MSE MSE[ ˆf(x; h)] =(nh)−1 R(κ)f(x) + 1 4 h4 µ2 2(κ)(f′′ (x))2 + o((nh)−1 + h4 ) 33 / 74
- 44. MSE MISE[ ˆf(·; h)] = AMISE[ ˆf(·; h)] + o((nh)−1 + h4 ) AMISE[ ˆf(·; h)] = (nh)−1 R(κ) + 1 4 h4 µ2 2(κ)R(f′′ ). AMISE MISE h : hAMISE = R(κ) µ2 2(κ)R(f′′)n 1/5 . 34 / 74
- 45. MSE MISE[ ˆf(·; h)] = AMISE[ ˆf(·; h)] + o((nh)−1 + h4 ) AMISE[ ˆf(·; h)] = (nh)−1 R(κ) + 1 4 h4 µ2 2(κ)R(f′′ ). AMISE MISE h : hAMISE = R(κ) µ2 2(κ)R(f′′)n 1/5 . 34 / 74
- 46. k f(z) z ∈ Rp D = {xi}n i=1 z k εk z ε p b(z; ε) = {x ∈ Rp|∥z − x∥ < ε} |b(z; ε)| = cpεp cp = πp/2/Γ(p/2 + 1) Γ( · ) 35 / 74
- 47. k ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● xi ∈ D ◦ z ∈ Rp × 36 / 74
- 48. k ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● xi ∈ D ◦ z ∈ Rp × 36 / 74
- 53. k ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ε z ε qz(ε) = b(z;ε) f(x)dx. k/n k ε = εk ε ε kε 38 / 74
- 54. k Taylor : qz(εk) = b(z;εk) {f(z) + ∇f(x)(z − x) + O(ε2 k)}dx = |b(z; εk)|(f(z) + O(ε2 k)) ≃ εp kcpf(z). cp Rp 39 / 74
- 55. k k n , εp kcpf(z) ˆfk(z) = k cpn ε−p k (5) 40 / 74
- 56. k k ˆfk(z) = k cpn ε−p k , (6) εk z D k 41 / 74
- 57. 42 / 74
- 58. 1 2 3 43 / 74
- 59. H(f) D = {xi}n i=1 xi ∈ Rp, i = 1, . . . , n f(x) X 44 / 74
- 60. z ε qz(ε) = x∈b(z;ε) f(x)dx (7) 45 / 74
- 61. z ε qz(ε) = x∈b(z;ε) f(x)dx (7) qz(ε) = x∈b(z;ε) f(x) + (z − x)⊤ ∇f(z) + O(ε2 ) dx = |b(z; ε)| f(z) + O(ε2 ) = cpεp f(z) + O(εp+2 ) k/n O(εp+2) 45 / 74
- 62. z ε qz(ε) ε qz(ε) = cpf(z)εp + p 4(p/2 + 1) cpεp+2 tr∇2 f(z)+O(εp+4 ) (8) 46 / 74
- 63. z ε qz(ε) ε qz(ε) = cpf(z)εp + p 4(p/2 + 1) cpεp+2 tr∇2 f(z)+O(εp+4 ) (8) qz(ε) kε/n cpεp kε ncpεp = f(z) + Cε2 + O(ε4 ) (9) C = ptr∇2f(z) 4(p/2+1) 46 / 74
- 64. Yε = kε ncpεp Xε = ε2 ε 4 Yε Xε Yε ≃ f(z) + CXε (10) 2 47 / 74
- 65. Yε ≃ f(z) + CXε Xε Yε ε 48 / 74
- 66. k [ ] ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ε z ε ε (k ) 49 / 74
- 67. k [ ] ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ε z ε ε (k ) 49 / 74
- 68. k [ ] ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ε z ε ε (k ) 49 / 74
- 69. k [ ] ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ε z ε ε (k ) 49 / 74
- 70. E = {ε1, . . . , εm}, m < n E ε {(Xε, Yε)}ε∈E R = 1 m ε∈E (Yε − f(z) − CXε)2 (11) f(z) C f(z) ˆfs(z) 50 / 74
- 71. z ˆfs(z) leave-one-out ˆHs(D) = − 1 n n i=1 ln ˆfs,i(xi), (12) ˆfs,i(xi) xi ˆHs(D) Simple Regression Entropy Estimator (SRE) [Hino+, 2015] 51 / 74
- 72. SRE: how it works −3 −2 −1 0 1 2 3 0.00.10.20.30.4 Normal x density 0 1 2 3 40.240.280.320.36 Normal epsilon^2 f(z) Fitted density function Fitted intercept ˆfs(z = 0.5) 52 / 74
- 73. SRE: how it works −3 −2 −1 0 1 2 3 0.000.100.200.30 Bimodal x density 1.0 1.5 2.0 2.5 3.0 3.5 4.00.2250.2350.245 Bimodal epsilon^2 f(z) Fitted density function Fitted intercept ˆfs(z = 0.5) 53 / 74
- 74. ε xi ∈ D Yε ≃ f(xi) + CXε Yε = kε ncpεp C = ptr∇2f(xi) 4(p/2+1) xi Y i ε Ci : Y i ε ≃ f(xi) + Ci Xε 54 / 74
- 75. Y i ε = f(xi) + CiXε xi ∈ D − 1 n n i=1 ln Y i ε = − 1 n n i=1 ln f(xi) + Ci Xε = − 1 n n i=1 ln f(xi) 1 + CiXε f(xi) = − 1 n n i=1 ln f(xi) − 1 n n i=1 ln 1 + CiXε f(xi) ≃ − 1 n n i=1 ln f(xi) − 1 n n i=1 Ci f(xi) Xε 55 / 74
- 76. − 1 n n i=1 ln Y i ε ≃ − 1 n n i=1 ln f(xi) − 1 n n i=1 Ci f(xi) Xε ¯Yε = − 1 n n i=1 ln Y i ε H(D) = − 1 n n i=1 f(xi) ¯C = − 1 n n i=1 Ci f(xi) ε > 0 ¯Yε = H(D) + ¯CXε (13) 56 / 74
- 77. ε ∈ E (13) Rd = 1 m ε∈E ( ¯Yε − H(D) − ¯CXε)2 Direct Regression Entropy Estimator (DRE) [Hino+, 2015] 57 / 74
- 78. qz(ε) = cpf(z)εp + p 4(p/2 + 1) cpεp+2 tr∇2 f(z) + O(εp+4 ) qz(ε) kε/n cpεp kε ncpεp = f(z) + Cε2 + O(ε4 ) Yε = f(z) + CXε 58 / 74
- 79. SRE min 1 m ε∈E (Yε − f(z) − CXε)2 , and ˆHs(D) = − 1 n n i=1 ln ˆfi(xi) DRE min 1 m ε∈E ( ¯Yε − H(D) − ¯CXε)2 59 / 74
- 80. k 60 / 74
- 81. qz(ε) = cpf(z)εp + p 4(p/2 + 1) cpεp+2 tr∇2 f(z) + O(εp+4 ) qz(ε) kε/n n : kε ≃ cpnf(z)εp + cpn p 4(p/2 + 1) tr∇2 f(z)εp+2 61 / 74
- 82. kε ≃ cpnf(z)εp + cpn p 4(p/2 + 1) tr∇2 f(z)εp+2 X = (εp, εp+2) Y = kε Y = β⊤X kε Poisson 62 / 74
- 83. max L(β) = m i=1 e−X⊤ i β(X⊤ i β)Yi Yi! εp β1 ˆβ1 z ˆβ1/(cpn) SRE LOO Entropy Estimator with Poisson-noise structure and Identity-link regression(EPI) [Hino+,under review] 63 / 74
- 84. 1 2 3 64 / 74
- 85. H(f) ˆH(D) AE = |H(f) − ˆH(D)| 100 65 / 74
- 86. Univariate Case 15 distributions −3 −2 −1 0 1 2 3 0.00.10.20.30.4 Normal x density −3 −2 −1 0 1 2 3 0.00.10.20.30.40.5 Skewed x density −3 −2 −1 0 1 2 3 0.00.20.40.60.81.01.21.4 Strongly Skewed x density −3 −2 −1 0 1 2 3 0.00.51.01.5 Kurtotic x density −3 −2 −1 0 1 2 3 0.000.050.100.150.200.250.30 Bimodal x density −3 −2 −1 0 1 2 3 0.00.10.20.30.4 Skewed Bimodal x density 66 / 74
- 87. Univariate Case 15 distributions −3 −2 −1 0 1 2 3 0.000.050.100.150.200.250.30 Trimodal x density −3 −2 −1 0 1 2 3 0.00.10.20.30.40.50.6 10 Claw x density −3 −2 −1 0 1 2 3 0.00.10.20.30.4 Standard Power Exponential x density −3 −2 −1 0 1 2 3 0.050.100.150.200.25 Standard Logistic x density −3 −2 −1 0 1 2 3 0.10.20.30.40.5 Standard Classical Laplace x density −3 −2 −1 0 1 2 3 0.10.20.3 t(df=5) x density 67 / 74
- 88. Univariate Case 15 distributions −3 −2 −1 0 1 2 3 0.050.100.150.200.25 Mixed t x density −3 −2 −1 0 1 2 3 0.00.20.40.60.81.0 Standard Exponential x density −3 −2 −1 0 1 2 3 0.050.100.150.200.250.30 Cauchy x density 68 / 74
- 89. ● ●● ● ● ● ● ● ● ● −3 −2 −1 0 1 2 3 0.00.10.20.30.4 Normal x density −3 −2 −1 0 1 2 3 0.00.10.20.30.40.5 Skewed x density −3 −2 −1 0 1 2 3 0.00.20.40.60.81.01.21.4 Strongly Skewed x density −3 −2 −1 0 1 2 3 0.00.51.01.5 Kurtotic x density −3 −2 −1 0 1 2 3 0.000.050.100.150.200.250.30 Bimodal x density 69 / 74
- 90. ● ● ● ● ● ● ●● ● ● −3 −2 −1 0 1 2 3 0.00.10.20.30.4 Skewed Bimodal x density −3 −2 −1 0 1 2 3 0.000.050.100.150.200.250.30 Trimodal x density −3 −2 −1 0 1 2 3 0.00.10.20.30.40.50.6 10 Claw x density −3 −2 −1 0 1 2 3 0.00.10.20.30.4 Standard Power Exponential x density −3 −2 −1 0 1 2 3 0.050.100.150.200.25 Standard Logistic x density 69 / 74
- 91. ●● ● ●● ● ● ● ● ● ● −3 −2 −1 0 1 2 3 0.10.20.30.40.5 Standard Classical Laplace x density −3 −2 −1 0 1 2 3 0.10.20.3 t(df=5) x density −3 −2 −1 0 1 2 3 0.050.100.150.200.25 Mixed t x density −3 −2 −1 0 1 2 3 0.00.20.40.60.81.0 Standard Exponential x density −3 −2 −1 0 1 2 3 0.050.100.150.200.250.30 Cauchy x density 69 / 74
- 92. Univariate Case Results: Curvature and Improvement tr∇2f k γ > 0 : f(x; γ) = 1 πγ(1 + (x/γ)2) . ∇2 f(x; γ) = 2 πγ3 3(x/γ)2 − 1 (1 + (x/γ)2)3 γ 0.01 0.9 n = 300 100 k EPI | ˆHk(D) − H(f)| − | ˆHs(D) − H(f)| 70 / 74
- 93. Univariate Case Results: Curvature and Improvement maxx∈R log |∇2f(x; γ)| −0.2 0.0 0.2 0.0 2.5 5.0 7.5 LogMaxCurvature Improvement 71 / 74
- 94. That’s all fork Pros. KDE k-NN Cons. 72 / 74
- 95. I [Faivishevsky&Goldberger, 2010] Faivishevsky, L. and Goldberger, J. (2010). A Nonparametric Information Theoretic Clustering Algorithm. ICML2010. [Hino+, 2015] Hino, H., Koshijima, K., and Murata, N. (2015). Non-parametric entropy estimators based on simple linear regression. Computational Statistics & Data Analysis, 89(0):72 – 84. [Hino&Murata, 2010] Hino, H. and Murata, N. (2010). A conditional entropy minimization criterion for dimensionality reduction and multiple kernel learning. Neural Computation, 22(11):2887–2923. [Hyv¨arinen&Oja, 2000] Hyv¨arinen, A. and Oja, E. (2000). Independent component analysis: algorithms and applications. Neural Networks, 13(4-5):411–430. [Koshijima+, 2015] Koshijima, K., Hino, H., and Murata, N. (2015). Change-point detection in a sequence of bags-of-data. Knowledge and Data Engineering, IEEE Transactions on, 27(10):2632–2644. 73 / 74
- 96. II [Murata+, 2013] Murata, N., Koshijima, K., and Hino, H. (2013). Distance-based change-point detection with entropy estimation. In Proceedings of the Sixth Workshop on Information Theoretic Methods in Science and Engineering, pages 22–25. 74 / 74