Upcoming SlideShare
×

# Slides: Hypothesis testing, information divergence and computational geometry

• 129 views

More in: Technology , Education
• Comment goes here.
Are you sure you want to
Be the first to comment
Be the first to like this

Total Views
129
On Slideshare
0
From Embeds
0
Number of Embeds
2

Shares
4
0
Likes
0

No embeds

### Report content

No notes for slide

### Transcript

• 1. Hypothesis testing, information divergence and computational geometry Frank Nielsen Frank.Nielsen@acm.org www.informationgeometry.org Sony Computer Science Laboratories, Inc. August 2013, GSI, Paris, FR c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 1/20
• 2. The Multiple Hypothesis Testing (MHT) problem Given a rv. X with n hypothesis H1 : X ∼ P1 , ..., Hn : X ∼ Pn , decide for a IID sample x1 , ..., xm ∼ X which hypothesis holds true? m m Pcorrect = 1 − Perror Asymptotic regime: α=− 1 m log Pe , m c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. m→∞ 2/20
• 3. Bayesian hypothesis testing (preliminaries) prior probabilities: wi = Pr(X ∼ Pi ) > 0 (with conditional probabilities: Pr(X = x|X ∼ Pi ). n i =1 wi = 1) n Pr(X ∼ Pi )Pr(X = x|X ∼ Pi ) Pr(X = x) = i =1 n wi Pr(X |Pi ) = i =1 Let ci ,j = cost of deciding Hi when in fact Hj is true. Matrix [cij ]= cost design matrix Let pi ,j (u) = probability of making this decision using rule u. c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 3/20
• 4. Bayesian detector Minimize the expected cost: EX [c(r (x))],  c(r (x)) = i wi  j=i ci ,j pi ,j (r (x)) Special case: Probability of error Pe obtained for ci ,i = 0 and ci ,j = 1 for i = j: Pe = EX     i wi j=i pi ,j (r (x)) The maximum a posteriori probability (MAP) rule considers classifying x: MAP(x) = argmaxi ∈{1,...,n} wi pi (x) where pi (x) = Pr(X = x|X ∼ Pi ) are the conditional probabilities. → MAP Bayesian detector minimizes Pe over all rules [8] c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 4/20
• 5. Probability of error and divergences Without loss of generality, consider equal priors ( w1 = w2 = 1 ): 2 p(x) min(Pr(H1 |x), Pr(H2 |x))dν(x) Pe = x∈X (Pe > 0 as soon as suppp1 ∩ suppp2 = ∅) i )Pr(X =x|H From Bayes’ rule Pr(Hi |X = x) = Pr(HPr(X =x) i ) = wi pi (x)/p(x) Pe = 1 2 min(p1 (x), p2 (x))dν(x) x∈X Rewrite or bound Pe using tricks of the trade: Trick 1. ∀a, b ∈ R, min(a, b) = Trick 2. ∀a, b > 0, min(a, b) ≤ c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. |a−b| 2 , minα∈(0,1) aα b 1−α , a+b 2 − 5/20
• 6. Probability of error and total variation Pe = = 1 2 1 2 p1 (x) + p2 (x) |p1 (x) − p2 (x)| − 2 2 x∈X 1− 1 2 dν(x), |p1 (x) − p2 (x)|dν(x) x∈X 1 Pe = (1 − TV(P1 , P2 )) 2 total variation metric distance: TV(P, Q) = 1 2 |p(x) − q(x)|dν(x) x∈X → Diﬃcult to compute when handling multivariate distributions. c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 6/20
• 7. Bounding the Probability of error Pe min(a, b) ≤ minα∈(0,1) aα b 1−α for a, b > 0, upper bound Pe : Pe = ≤ 1 min(p1 (x), p2 (x))dν(x) 2 x∈X 1 1−α p α (x)p2 (x)dν(x). min 2 α∈(0,1) x∈X 1 C (P1 , P2 ) = − log min α∈(0,1) Best error exponent α∗ ∗ x∈X 1−α α p1 (x)p2 (x)dν(x) ≥ 0, [7]: ∗ 1−α −C (P1 ,P2 ) α Pe ≤ w1 w2 e ≤ e −C (P1 ,P2 ) Bounding technique can be extended using any quasi-arithmetic α-means [13, 9]... c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 7/20
• 8. Computational information geometry Exponential family manifold [4]: M = {pθ | pθ (x) = exp(t(x)⊤ θ − F (θ))} Dually ﬂat manifolds [1] enjoy dual aﬃne connections [1]: (M, ∇2 F (θ), ∇(e) , ∇(m) ). η = ∇F (θ), θ = ∇F ∗ (η) Canonical divergence from Young inequality: ⊤ A(θ1 , η2 ) = F (θ1 ) + F ∗ (η2 ) − θ1 η2 ≥ 0 F (θ) + F ∗ (η) = θ ⊤ η c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 8/20
• 9. MAP decision rule and additive Bregman Voronoi diagrams KL(pθ1 : pθ2 ) = B(θ2 : θ1 ) = A(θ2 : η1 ) = A∗ (η1 : θ2 ) = B ∗ (η1 : η2 ) Canonical divergence (mixed primal/dual coordinates): ⊤ A(θ2 : η1 ) = F (θ2 ) + F ∗ (η1 ) − θ2 η1 ≥ 0 Bregman divergence (uni-coordinates, primal or dual): B(θ2 : θ1 ) = F (θ2 ) − F (θ1 ) − (θ2 − θ1 )⊤ ∇F (θ1 ) log pi (x) = −B ∗ (t(x) : ηi ) + F ∗ (t(x)) + k(x), ηi = ∇F (θi ) = η(Pθi ) Optimal MAP decision rule: MAP(x) = argmaxi ∈{1,...,n} wi pi (x) = argmaxi ∈{1,...,n} − B ∗ (t(x) : ηi ) + log wi , = argmini ∈{1,...,n} B ∗ (t(x) : ηi ) − log wi → nearest neighbor classiﬁer [2, 10, 15, 16] c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 9/20
• 10. MAP & nearest neighbor classiﬁer Bregman Voronoi diagrams (with additive weights) are aﬃne diagrams [2]. argmini ∈{1,...,n} B ∗ (t(x) : ηi ) − log wi ◮ ◮ ◮ point location in arrangement [3] (small dims), Divergence-based search trees [16], GPU brute force [6]. c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 10/20
• 11. Geometry of the best error exponent: binary hypothesis On the exponential family manifold, Chernoﬀ α-coeﬃcient [5]: cα (Pθ1 : Pθ2 ) = (α) 1−α α pθ1 (x)pθ2 (x)dµ(x) = exp(−JF (θ1 : θ2 )), Skew Jensen divergence [14] on the natural parameters: (α) (α) JF (θ1 : θ2 ) = αF (θ1 ) + (1 − α)F (θ2 ) − F (θ12 ), Chernoﬀ information = Bregman divergence for exponential families: (α∗ ) (α∗ ) C (Pθ1 : Pθ2 ) = B(θ1 : θ12 ) = B(θ2 : θ12 ) c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 11/20
• 12. Geometry of the best error exponent: binary hypothesis Chernoﬀ distribution P ∗ [12]: ∗ P ∗ = Pθ12 = Ge (P1 , P2 ) ∩ Bim (P1 , P2 ) e-geodesic: (λ) (λ) Ge (P1 , P2 ) = {E12 | θ(E12 ) = (1 − λ)θ1 + λθ2 , λ ∈ [0, 1]}, m-bisector: Bim (P1 , P2 ) : {P | F (θ1 ) − F (θ2 ) + η(P)⊤ ∆θ = 0}, Optimal natural parameter of P ∗ : (α∗ ) θ ∗ = θ12 = argminθ∈Θ B(θ1 : θ) = argminθ∈Θ B(θ2 : θ). → closed-form for order-1 family, or eﬃcient bisection search. c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 12/20
• 13. Geometry of the best error exponent: binary hypothesis ∗ P ∗ = Pθ12 = Ge (P1 , P2 ) ∩ Bim (P1 , P2 ) m-bisector Bim (Pθ1 , Pθ2 ) η-coordinate system ∗ pθ12 e-geodesic Ge (Pθ1 , Pθ2 ) pθ2 Pθ ∗ 12 pθ 1 ∗ C(θ1 : θ2 ) = B(θ1 : θ12 ) c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 13/20
• 14. Geometry of the best error exponent: multiple hypothesis n-ary MHT [8] from minimum pairwise Chernoﬀ distance: C (P1 , ..., Pn ) = min C (Pi , Pj ) i ,j=i m Pe ≤ e −mC (Pi ∗ ,Pj ∗ ) , (i ∗ , j ∗ ) = argmini ,j=i C (Pi , Pj ) Compute for each pair of natural neighbors [3] Pθi and Pθj , the Chernoﬀ distance C (Pθi , Pθj ), and choose the pair with minimal distance. (Proof by contradiction using Bregman Pythagoras theorem.) → Closest Bregman pair problem (Chernoﬀ distance fails triangle inequality). c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 14/20
• 15. Hypothesis testing: Illustration η-coordinate system Chernoﬀ distribution between natural neighbours c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 15/20
• 16. Summary Bayesian multiple hypothesis testing... ... from the viewpoint of computational geometry. ◮ probability of error & best MAP Bayesian rule ◮ total variation & Pe , upper-bounded by the Chernoﬀ distance. Exponential family manifolds: ◮ ◮ ◮ ◮ MAP rule = NN classiﬁer (additive Bregman Voronoi diagram) best error exponent from intersection geodesic/bisector for binary hypothesis, best error exponent from closest Bregman pair for multiple hypothesis. c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 16/20
• 17. Thank you 28th-30th August, Paris. @incollection{HTIGCG-GSI-2013, year={2013}, booktitle={Geometric Science of Information}, volume={8085}, series={Lecture Notes in Computer Science}, editor={Frank Nielsen and Fr’ed’eric Barbaresco}, title={Hypothesis testing, information divergence and computational geometry}, publisher={Springer Berlin Heidelberg}, author={Nielsen, Frank}, pages={241-248} } c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 17/20
• 18. Bibliographic references I Shun-ichi Amari and Hiroshi Nagaoka. Methods of Information Geometry. Oxford University Press, 2000. Jean-Daniel Boissonnat, Frank Nielsen, and Richard Nock. Bregman Voronoi diagrams. Discrete & Computational Geometry, 44(2):281–307, 2010. Jean-Daniel Boissonnat and Mariette Yvinec. Algorithmic Geometry. Cambridge University Press, New York, NY, USA, 1998. Lawrence D. Brown. Fundamentals of statistical exponential families: with applications in statistical decision theory. Institute of Mathematical Statistics, Hayworth, CA, USA, 1986. Herman Chernoﬀ. A measure of asymptotic eﬃciency for tests of a hypothesis based on the sum of observations. Annals of Mathematical Statistics, 23:493–507, 1952. Vincent Garcia, Eric Debreuve, Frank Nielsen, and Michel Barlaud. k-nearest neighbor search: Fast GPU-based implementations and application to high-dimensional feature matching. In IEEE International Conference on Image Processing (ICIP), pages 3757–3760, 2010. c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 18/20
• 19. Bibliographic references II Martin E. Hellman and Josef Raviv. Probability of error, equivocation and the Chernoﬀ bound. IEEE Transactions on Information Theory, 16:368–372, 1970. C. C. Leang and D. H. Johnson. On the asymptotics of M-hypothesis Bayesian detection. IEEE Transactions on Information Theory, 43(1):280–282, January 1997. Frank Nielsen. Generalized Bhattacharyya and Chernoﬀ upper bounds on Bayes error using quasi-arithmetic means. submitted, 2012. Frank Nielsen. k-MLE: A fast algorithm for learning statistical mixture models. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). IEEE, 2012. preliminary, technical report on arXiv. Frank Nielsen. Hypothesis testing, information divergence and computational geometry. In Frank Nielsen and Fr´d´ric Barbaresco, editors, Geometric Science of Information, volume 8085 of e e Lecture Notes in Computer Science, pages 241–248. Springer Berlin Heidelberg, 2013. c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 19/20
• 20. Bibliographic references III Frank Nielsen. An information-geometric characterization of Chernoﬀ information. IEEE Signal Processing Letters (SPL), 20(3):269–272, March 2013. Frank Nielsen. Pattern learning and recognition on statistical manifolds: An information-geometric review. In Edwin Hancock and Marcello Pelillo, editors, Similarity-Based Pattern Recognition, volume 7953 of Lecture Notes in Computer Science, pages 1–25. Springer Berlin Heidelberg, 2013. Frank Nielsen and Sylvain Boltz. The Burbea-Rao and Bhattacharyya centroids. IEEE Transactions on Information Theory, 57(8):5455–5466, 2011. Frank Nielsen, Paolo Piro, and Michel Barlaud. Bregman vantage point trees for eﬃcient nearest neighbor queries. In Proceedings of the 2009 IEEE International Conference on Multimedia and Expo (ICME), pages 878–881, 2009. Paolo Piro, Frank Nielsen, and Michel Barlaud. Tailored Bregman ball trees for eﬀective nearest neighbors. In European Workshop on Computational Geometry (EuroCG), LORIA, Nancy, France, March 2009. IEEE. c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 20/20