Slides: Hypothesis testing, information divergence and computational geometry

  • 129 views
Uploaded on

 

More in: Technology , Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
129
On Slideshare
0
From Embeds
0
Number of Embeds
2

Actions

Shares
Downloads
4
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Hypothesis testing, information divergence and computational geometry Frank Nielsen Frank.Nielsen@acm.org www.informationgeometry.org Sony Computer Science Laboratories, Inc. August 2013, GSI, Paris, FR c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 1/20
  • 2. The Multiple Hypothesis Testing (MHT) problem Given a rv. X with n hypothesis H1 : X ∼ P1 , ..., Hn : X ∼ Pn , decide for a IID sample x1 , ..., xm ∼ X which hypothesis holds true? m m Pcorrect = 1 − Perror Asymptotic regime: α=− 1 m log Pe , m c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. m→∞ 2/20
  • 3. Bayesian hypothesis testing (preliminaries) prior probabilities: wi = Pr(X ∼ Pi ) > 0 (with conditional probabilities: Pr(X = x|X ∼ Pi ). n i =1 wi = 1) n Pr(X ∼ Pi )Pr(X = x|X ∼ Pi ) Pr(X = x) = i =1 n wi Pr(X |Pi ) = i =1 Let ci ,j = cost of deciding Hi when in fact Hj is true. Matrix [cij ]= cost design matrix Let pi ,j (u) = probability of making this decision using rule u. c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 3/20
  • 4. Bayesian detector Minimize the expected cost: EX [c(r (x))],  c(r (x)) = i wi  j=i ci ,j pi ,j (r (x)) Special case: Probability of error Pe obtained for ci ,i = 0 and ci ,j = 1 for i = j: Pe = EX     i wi j=i pi ,j (r (x)) The maximum a posteriori probability (MAP) rule considers classifying x: MAP(x) = argmaxi ∈{1,...,n} wi pi (x) where pi (x) = Pr(X = x|X ∼ Pi ) are the conditional probabilities. → MAP Bayesian detector minimizes Pe over all rules [8] c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 4/20
  • 5. Probability of error and divergences Without loss of generality, consider equal priors ( w1 = w2 = 1 ): 2 p(x) min(Pr(H1 |x), Pr(H2 |x))dν(x) Pe = x∈X (Pe > 0 as soon as suppp1 ∩ suppp2 = ∅) i )Pr(X =x|H From Bayes’ rule Pr(Hi |X = x) = Pr(HPr(X =x) i ) = wi pi (x)/p(x) Pe = 1 2 min(p1 (x), p2 (x))dν(x) x∈X Rewrite or bound Pe using tricks of the trade: Trick 1. ∀a, b ∈ R, min(a, b) = Trick 2. ∀a, b > 0, min(a, b) ≤ c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. |a−b| 2 , minα∈(0,1) aα b 1−α , a+b 2 − 5/20
  • 6. Probability of error and total variation Pe = = 1 2 1 2 p1 (x) + p2 (x) |p1 (x) − p2 (x)| − 2 2 x∈X 1− 1 2 dν(x), |p1 (x) − p2 (x)|dν(x) x∈X 1 Pe = (1 − TV(P1 , P2 )) 2 total variation metric distance: TV(P, Q) = 1 2 |p(x) − q(x)|dν(x) x∈X → Difficult to compute when handling multivariate distributions. c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 6/20
  • 7. Bounding the Probability of error Pe min(a, b) ≤ minα∈(0,1) aα b 1−α for a, b > 0, upper bound Pe : Pe = ≤ 1 min(p1 (x), p2 (x))dν(x) 2 x∈X 1 1−α p α (x)p2 (x)dν(x). min 2 α∈(0,1) x∈X 1 C (P1 , P2 ) = − log min α∈(0,1) Best error exponent α∗ ∗ x∈X 1−α α p1 (x)p2 (x)dν(x) ≥ 0, [7]: ∗ 1−α −C (P1 ,P2 ) α Pe ≤ w1 w2 e ≤ e −C (P1 ,P2 ) Bounding technique can be extended using any quasi-arithmetic α-means [13, 9]... c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 7/20
  • 8. Computational information geometry Exponential family manifold [4]: M = {pθ | pθ (x) = exp(t(x)⊤ θ − F (θ))} Dually flat manifolds [1] enjoy dual affine connections [1]: (M, ∇2 F (θ), ∇(e) , ∇(m) ). η = ∇F (θ), θ = ∇F ∗ (η) Canonical divergence from Young inequality: ⊤ A(θ1 , η2 ) = F (θ1 ) + F ∗ (η2 ) − θ1 η2 ≥ 0 F (θ) + F ∗ (η) = θ ⊤ η c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 8/20
  • 9. MAP decision rule and additive Bregman Voronoi diagrams KL(pθ1 : pθ2 ) = B(θ2 : θ1 ) = A(θ2 : η1 ) = A∗ (η1 : θ2 ) = B ∗ (η1 : η2 ) Canonical divergence (mixed primal/dual coordinates): ⊤ A(θ2 : η1 ) = F (θ2 ) + F ∗ (η1 ) − θ2 η1 ≥ 0 Bregman divergence (uni-coordinates, primal or dual): B(θ2 : θ1 ) = F (θ2 ) − F (θ1 ) − (θ2 − θ1 )⊤ ∇F (θ1 ) log pi (x) = −B ∗ (t(x) : ηi ) + F ∗ (t(x)) + k(x), ηi = ∇F (θi ) = η(Pθi ) Optimal MAP decision rule: MAP(x) = argmaxi ∈{1,...,n} wi pi (x) = argmaxi ∈{1,...,n} − B ∗ (t(x) : ηi ) + log wi , = argmini ∈{1,...,n} B ∗ (t(x) : ηi ) − log wi → nearest neighbor classifier [2, 10, 15, 16] c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 9/20
  • 10. MAP & nearest neighbor classifier Bregman Voronoi diagrams (with additive weights) are affine diagrams [2]. argmini ∈{1,...,n} B ∗ (t(x) : ηi ) − log wi ◮ ◮ ◮ point location in arrangement [3] (small dims), Divergence-based search trees [16], GPU brute force [6]. c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 10/20
  • 11. Geometry of the best error exponent: binary hypothesis On the exponential family manifold, Chernoff α-coefficient [5]: cα (Pθ1 : Pθ2 ) = (α) 1−α α pθ1 (x)pθ2 (x)dµ(x) = exp(−JF (θ1 : θ2 )), Skew Jensen divergence [14] on the natural parameters: (α) (α) JF (θ1 : θ2 ) = αF (θ1 ) + (1 − α)F (θ2 ) − F (θ12 ), Chernoff information = Bregman divergence for exponential families: (α∗ ) (α∗ ) C (Pθ1 : Pθ2 ) = B(θ1 : θ12 ) = B(θ2 : θ12 ) c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 11/20
  • 12. Geometry of the best error exponent: binary hypothesis Chernoff distribution P ∗ [12]: ∗ P ∗ = Pθ12 = Ge (P1 , P2 ) ∩ Bim (P1 , P2 ) e-geodesic: (λ) (λ) Ge (P1 , P2 ) = {E12 | θ(E12 ) = (1 − λ)θ1 + λθ2 , λ ∈ [0, 1]}, m-bisector: Bim (P1 , P2 ) : {P | F (θ1 ) − F (θ2 ) + η(P)⊤ ∆θ = 0}, Optimal natural parameter of P ∗ : (α∗ ) θ ∗ = θ12 = argminθ∈Θ B(θ1 : θ) = argminθ∈Θ B(θ2 : θ). → closed-form for order-1 family, or efficient bisection search. c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 12/20
  • 13. Geometry of the best error exponent: binary hypothesis ∗ P ∗ = Pθ12 = Ge (P1 , P2 ) ∩ Bim (P1 , P2 ) m-bisector Bim (Pθ1 , Pθ2 ) η-coordinate system ∗ pθ12 e-geodesic Ge (Pθ1 , Pθ2 ) pθ2 Pθ ∗ 12 pθ 1 ∗ C(θ1 : θ2 ) = B(θ1 : θ12 ) c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 13/20
  • 14. Geometry of the best error exponent: multiple hypothesis n-ary MHT [8] from minimum pairwise Chernoff distance: C (P1 , ..., Pn ) = min C (Pi , Pj ) i ,j=i m Pe ≤ e −mC (Pi ∗ ,Pj ∗ ) , (i ∗ , j ∗ ) = argmini ,j=i C (Pi , Pj ) Compute for each pair of natural neighbors [3] Pθi and Pθj , the Chernoff distance C (Pθi , Pθj ), and choose the pair with minimal distance. (Proof by contradiction using Bregman Pythagoras theorem.) → Closest Bregman pair problem (Chernoff distance fails triangle inequality). c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 14/20
  • 15. Hypothesis testing: Illustration η-coordinate system Chernoff distribution between natural neighbours c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 15/20
  • 16. Summary Bayesian multiple hypothesis testing... ... from the viewpoint of computational geometry. ◮ probability of error & best MAP Bayesian rule ◮ total variation & Pe , upper-bounded by the Chernoff distance. Exponential family manifolds: ◮ ◮ ◮ ◮ MAP rule = NN classifier (additive Bregman Voronoi diagram) best error exponent from intersection geodesic/bisector for binary hypothesis, best error exponent from closest Bregman pair for multiple hypothesis. c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 16/20
  • 17. Thank you 28th-30th August, Paris. @incollection{HTIGCG-GSI-2013, year={2013}, booktitle={Geometric Science of Information}, volume={8085}, series={Lecture Notes in Computer Science}, editor={Frank Nielsen and Fr’ed’eric Barbaresco}, title={Hypothesis testing, information divergence and computational geometry}, publisher={Springer Berlin Heidelberg}, author={Nielsen, Frank}, pages={241-248} } c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 17/20
  • 18. Bibliographic references I Shun-ichi Amari and Hiroshi Nagaoka. Methods of Information Geometry. Oxford University Press, 2000. Jean-Daniel Boissonnat, Frank Nielsen, and Richard Nock. Bregman Voronoi diagrams. Discrete & Computational Geometry, 44(2):281–307, 2010. Jean-Daniel Boissonnat and Mariette Yvinec. Algorithmic Geometry. Cambridge University Press, New York, NY, USA, 1998. Lawrence D. Brown. Fundamentals of statistical exponential families: with applications in statistical decision theory. Institute of Mathematical Statistics, Hayworth, CA, USA, 1986. Herman Chernoff. A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. Annals of Mathematical Statistics, 23:493–507, 1952. Vincent Garcia, Eric Debreuve, Frank Nielsen, and Michel Barlaud. k-nearest neighbor search: Fast GPU-based implementations and application to high-dimensional feature matching. In IEEE International Conference on Image Processing (ICIP), pages 3757–3760, 2010. c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 18/20
  • 19. Bibliographic references II Martin E. Hellman and Josef Raviv. Probability of error, equivocation and the Chernoff bound. IEEE Transactions on Information Theory, 16:368–372, 1970. C. C. Leang and D. H. Johnson. On the asymptotics of M-hypothesis Bayesian detection. IEEE Transactions on Information Theory, 43(1):280–282, January 1997. Frank Nielsen. Generalized Bhattacharyya and Chernoff upper bounds on Bayes error using quasi-arithmetic means. submitted, 2012. Frank Nielsen. k-MLE: A fast algorithm for learning statistical mixture models. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). IEEE, 2012. preliminary, technical report on arXiv. Frank Nielsen. Hypothesis testing, information divergence and computational geometry. In Frank Nielsen and Fr´d´ric Barbaresco, editors, Geometric Science of Information, volume 8085 of e e Lecture Notes in Computer Science, pages 241–248. Springer Berlin Heidelberg, 2013. c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 19/20
  • 20. Bibliographic references III Frank Nielsen. An information-geometric characterization of Chernoff information. IEEE Signal Processing Letters (SPL), 20(3):269–272, March 2013. Frank Nielsen. Pattern learning and recognition on statistical manifolds: An information-geometric review. In Edwin Hancock and Marcello Pelillo, editors, Similarity-Based Pattern Recognition, volume 7953 of Lecture Notes in Computer Science, pages 1–25. Springer Berlin Heidelberg, 2013. Frank Nielsen and Sylvain Boltz. The Burbea-Rao and Bhattacharyya centroids. IEEE Transactions on Information Theory, 57(8):5455–5466, 2011. Frank Nielsen, Paolo Piro, and Michel Barlaud. Bregman vantage point trees for efficient nearest neighbor queries. In Proceedings of the 2009 IEEE International Conference on Multimedia and Expo (ICME), pages 878–881, 2009. Paolo Piro, Frank Nielsen, and Michel Barlaud. Tailored Bregman ball trees for effective nearest neighbors. In European Workshop on Computational Geometry (EuroCG), LORIA, Nancy, France, March 2009. IEEE. c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 20/20