SUPERVISOR prof.            Anna CORAZZA
  UNIVERSITÀ degli STUDI di NAPOLI FEDERICO II
                                  ...
UNIVERSITÀ degli STUDI di NAPOLI FEDERICO II             Introduction


What is the clustering?
Non-structured data




  ...
UNIVERSITÀ degli STUDI di NAPOLI FEDERICO II           Goals


Two state-of-the-art approaches
       Support Vector Clust...
ers. Finally, the MEB was used for the Support Vector Domain Descriptio
 9).
 ption
 sed for finding degli STUDI di Vector ...
i    j    i,j=1,2,··· ,n
                                                                                   cluster labeli...
en a pair of data points that belong to different clusters, any β ← clusterD    5:        path that c
        UNIVERSITÀ d...
UNIVERSITÀ degli STUDI di NAPOLI FEDERICO II           Support Vector Clustering


Pseudo-hierarchical execution

      Pa...
UNIVERSITÀ degli STUDI di NAPOLI FEDERICO II             Support Vector Clustering


Proposed improvements
       Soft Mar...
UNIVERSITÀ degli STUDI di NAPOLI FEDERICO II              Support Vector Clustering


Improvements - Stop criterion
      ...
UNIVERSITÀ degli STUDI di NAPOLI FEDERICO II                     Support Vector Clustering


Improvements - Kernel width s...
UNIVERSITÀ degli STUDI di NAPOLI FEDERICO II              Support Vector Clustering


Improvements - non-Gaussian kernels
...
φ        2          1     C, expectation of 2 its interior2
efine the relative 1interior of the set the denoted1 ri(C), as ...
UNIVERSITÀ degli STUDI di NAPOLI FEDERICO II             Other experiments


Sparse data and missing-valued data
   Star/G...
UNIVERSITÀ degli STUDI di NAPOLI FEDERICO II           Other experiments


Outliers
    Dataset            SVC            ...
UNIVERSITÀ degli STUDI di NAPOLI FEDERICO II           Conclusions and future works


Conclusions
     Support Vector Clus...
UNIVERSITÀ degli STUDI di NAPOLI FEDERICO II           Conclusions and future works


Contribution


    SVC was made appl...
UNIVERSITÀ degli STUDI di NAPOLI FEDERICO II                                                         Conclusion and future...
Upcoming SlideShare
Loading in...5
×

State-of-the-art Clustering Techniques: Support Vector Methods and Minimum Bregman Information Principle

3,567
-1

Published on

Clustering is an automatic learning technique aimed at grouping a set of objects into subsets or clusters. The goal is to create clusters that are coherent internally, but substantially different from each other. In plain words, objects in the same cluster should be as similar as possible, whereas objects in one cluster should be as dissimilar as possible from objects in the other clusters.

Clustering is an unsupervised learning technique, because it groups objects in clusters without any additional information: only the information provided by data is used and no human operation adds bits of information to improve the learning.

The application domains are manifold. For example, the grouping of text documents: in this case the goal is the construction of groups of documents related to each other, i.e. documents treating the same argument.

The goal of this thesis is studying in depth state-of-the-art and experimental clustering techniques. We consider two techniques. The first is known as Minimum Bregman Information principle. Such a principle generalizes the classic relocation scheme adopted yet by K-means, in order to allow the employment of a rich gamma of divergence functions said just Bregman divergences. A new, more general, clustering scheme was developed on top of this principle. Moreover, a co-clustering scheme is formulated too. This leads to an important generalization, as we will see in the sequel.

The second approach is the Support Vector Clustering. It is a clustering process which relies on the state-of-the-art of the learning machines: the Support Vector Machines. The Support Vector Clustering is currently subject of active research, as it still is in early stage of development. We have accurately analyzed such a clustering method and we have also provided some contributions which allow allow a reduction in the number of iterations and in the computational complexity and a gain in accuracy.

The main application domains we have dealt to are the text mining and the astrophysics data mining. Within these application domains we have verified and accurately analyzed the properties of both methodologies, by means of dedicated experiments.

The results are given in terms of robustness w.r.t. the missing values, the dimensionality reduction, the robustness w.r.t. the noise and the outliers, the ability of describing clusters of arbitrary shapes.

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
3,567
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
122
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

State-of-the-art Clustering Techniques: Support Vector Methods and Minimum Bregman Information Principle

  1. 1. SUPERVISOR prof. Anna CORAZZA UNIVERSITÀ degli STUDI di NAPOLI FEDERICO II CO-SUPERVISOR prof. Ezio CATANZARITI State-of-the-art Clustering Techniques Support Vector Methods and Minimum Bregman Information Principle by VINCENZO RUSSO VINCENZO RUSSO STATE-OF-THE-ART CLUSTERING TECHNIQUES: SUPPORT VECTOR METHODS AND MINIMUM BREGMAN INFORMATION PRINCIPLE
  2. 2. UNIVERSITÀ degli STUDI di NAPOLI FEDERICO II Introduction What is the clustering? Non-structured data Unsupervised learning: groups a set of objects in subsets called clusters The objects are represented as points in a subspace of Rd d is the number of point CLUSTERING components, also called attributes or features 3-clusters structure Several application domains: information retrieval, bioinformatics, cheminformatics, image retrieval, astrophsics, market segmentation, etc. VINCENZO RUSSO STATE-OF-THE-ART CLUSTERING TECHNIQUES: SUPPORT VECTOR METHODS AND MINIMUM BREGMAN INFORMATION PRINCIPLE
  3. 3. UNIVERSITÀ degli STUDI di NAPOLI FEDERICO II Goals Two state-of-the-art approaches Support Vector Clustering (SVC) Bregman Co-clustering Goals Application domain Robustness w.r.t. Missing-valued Data Astrophysics Robustness w.r.t. Sparse Data Textual documents Robustness w.r.t. High “dimensionality” Textual documents Robustness w.r.t. Noise/Outliers t Synthetic data t Other desirable properties Nonlinear separable problems handling Automatic detection of the number of clusters Application domain independent VINCENZO RUSSO STATE-OF-THE-ART CLUSTERING TECHNIQUES: SUPPORT VECTOR METHODS AND MINIMUM BREGMAN INFORMATION PRINCIPLE
  4. 4. ers. Finally, the MEB was used for the Support Vector Domain Descriptio 9). ption sed for finding degli STUDI di Vector DomainSupportof classification (Tax, 2001; Tax an rtunately, the SupportisNAPOLI FEDERICO classclass finding such(Tax, 2001; Tax and n SVM formulation not the one II Description DD), UNIVERSITÀ the MEB for enough. The process Vector Clustering a an SVM formulation for the one classification 9a,b, smallestdetect 6SVDD firstthe(Tax,called Clusterthe the SVC and allows descri re is the able toclass classification or 1999a,b, 6 The the clusteris iswas firstlystep ofand toand allows describ- 2004). 2004). The SVDD basic 2001;modeled SVC by only one boundaries which are Tax mapping n, sphere to the data space. Thissphere the basic of Description the the enclosing phase was step presented Support Vector Clustering: the idea D isetboundaries of clusters. determiningallows describ- Hur the(2001). clusters. the SVC and the membership of points undaries of Astep of (Vapnik, 1995). Later it was used the al. basic second stage nenkis (VC) dimension for x1 ,of 2a ·high-dimensional called this points, with X though,Rd ,to adata space. W eX = {xis,needed.·be a}Mapping n of nCluster Labeling, ⊆ Rd it rtclusters · · x2 , n } Thenauthors distributiona(Schölkopf et al., thethe higher We x , Nonlinear dataset of from points,space X ⊆ 1 , x · · , x be a dataset step data with 5 data space. ly does a cluster assignment. e following subsections with X : Xoverview of the space. input aset usedpoints, wefeaturean φ →XF→ F from the Wespace X X to some hig of dimensional provide⊆ Rd , thefromSupport Vector Clus- data the input a nonlinear transformationVector Domain Description space to some high was transformation φ space inear n for the Support : mensionalas originallyclass input Ben-Hur look(2001).theTax and enclosing sphere nal feature spacespaceclassification etEnclosingthe smallest enclosing sphe φalgorithm F we find the F,by space Xal. look for high (MEB), i.e. the ation → feature the Minimumweto some smallest In from proposed wherein g : X for the one F, wherein we (Tax,for 2001; Ball R. This weformalized asof all follows allows describ- having the e SVDDsphereis enclosingsmallest enclosing sphere and herein isThis basic step follows and adius R.is the formalizedthe feature-space images look for the as SVC s Cluster cluster labeling probably descends from the originally proposed algorithm which description sters. minimum radius 5 follows The name meacluster labeling probably descends fromthe spherepresented to algorithm which is Mapping smallest enclosing of, thedata space. in e onformulation for the back on originally proposed SVM dataset then points, with X ⊆ sphere was firstly algorithmscontours the connect of connected componentsRd:a graph: the splitsWe for finding d finding ding the connected originally proposed algorithm which is usedfinding the connected in the Vapnik-Chervonenkis (VC) the input graph:1995).algorithms for from the components of a of Supportthe vertices. dimension (Vapnik, the Later it was descends The assign the “components labels” totoVectors high and describe mation φusually contours constist space X ponents : X → F from some (SV), 6nents of assign the of algorithms for finding the (Schölkopf usually a support “components labels” to the vertices. stimating thegraph: the a high-dimensional distribution connectedet al., eAn alternative clusters for thefor the same task, called One Class SVM, can be found . F, wherein SVM formulation Support Vector called One Class SVM, can be found in Finally,SVM formulation for the smallest enclosing sphere rnative we look the to the vertices. the MEB was used for the same task, Domain Description onents labels” ölkopf et al. (2000b) (seefor the one class classification (Tax, 2001; Tax and D), an SVM formulation Appendix A). lized as follows al. (2000b) (see Appendix A). Class SVM, for the same 6task,SVDD is the basic step of the can be found describ- , 1999a,b, 2004). The called One SVC and allows in A). he boundaries of clusters. the originally proposed algorithm which is robably descends from be a dataset of n points, with X ⊆ Rd , the data space. We 9 d components of = {x1 , x2 , · · · , xn }a graph: the algorithms for finding the connected 91 nonlinear transformation φthe vertices. the input space X φ−1some high “components labels” to : X → F from to : F → X nsional feature space F, wherein we look for the smallest enclosing sphere ulation for the same task, called One Class SVM, can be found in 91 dius R. This is formalized as follows ppendix A). he name cluster labeling probably descends from the originally proposed algorithm which is on finding the connected components of a graph: the algorithms for finding the connected onents usually assign the “components labels” to the vertices. n alternative SVM formulation for the same task, called One Class SVM, can be found in 91 opf et al. (2000b) (see Appendix A). VINCENZO RUSSO STATE-OF-THE-ART CLUSTERING TECHNIQUES: SUPPORT VECTOR METHODS AND MINIMUM BREGMAN INFORMATION PRINCIPLE
  5. 5. i j i,j=1,2,··· ,n cluster labeling. To calculat rnel. UNIVERSITÀ degli STUDI di NAPOLI FEDERICO II Support Vector Clustering analyze each step separately subject to Phase I: Cluster K(·) φ(xk ) − he kernel functiondescription a defines an2 ≤explicit 1, 2, · · · , n if φ is known, othe R2 , k = mapping 6.1.5.1 Cluster description apping thesaid toof the sphere. In the majority ofincorporatedfunction φ is u is center be implicit. Soft constraints are cases, the by adding here a is Finding the Minimum Enclosing Ball (MEB) we can implicitly Class ck variables ξk perform an inner product in the feature space F clus The complexity of the by Class Nonlinear Support Vector Domain Description (SVDD) we have -1 (see-1Equation 6.3) kernel K. CHAPTER 6. SUPPORT VECTOR CLUSTERING QP problem; computational complexity O(n3 ) worst-case running ti sing nonlinear kernel transformations, we have a chance to transform a n the QP problem can be sol able problem + Cdata (Squared Feature Space Distance) . in be a Parameters Optmiza min R2 Definition ξkspace to a separable oneLet Sequential Minimal(see Fig in 6.1 x feature space (6.2) data point. We define R,a,ξ PORT VECTOR CLUSTERINGA nonlinear separablespace F, φ(x), from the center = kernel width metho Figure the distance of its image in feature problem in the data space Xsphere, a, as 2.3: k=1 of the that becomes line tionqmethods. These follows the feature space F. timeC = soft margin (approx complexity to subject to 6.1.1 Valid Mercer kernels in R2 (x) = φ(x) − a 2 reduced to O(1)(6.13) dR subspaces n 2 2 (Ben-Hur e Squared Feature Space Distance) . φ(xx ) − a point. We ξk , k = 1, 2, · · · , n Let k be a data ≤ R + define here are severalF, φ(x), from thewhich the k = 1,of the ·kernelsatisfythe kernelized cond image in feature spaceview of Equation 6.6 and the are known to we have Mercer’s In functions center of definition a, as· , n ξk ≥ 0, sphere, 2, · n In polynomial kernels, the parameter k is 6.1.5.2 Cluster labeling co version of the above distance the degree. In the expon ⊆ R . Some of them are solve this problem φ(x)introduce the−parameter q is called kernel width. isThe k kernels=(and − 2 (x) = K(x, x) Lagrangianx) + d2 (x) we others), the 2 β K(x , (6.13) dR 2 a n n n TheK(xk , xl ) labeling comp βk βl cluster (6.14) R k k • Linear kernel: K(x, y) = xy meaning depending on the kernel: in th has different mathematical k=1 k=1 l=1jacency matrix A (see Equa tion 6.6 and the definition solutionkernel functioni.e.kernelized Lagrangian multipliers associ-undirect Gaussian kernel, 2it vector we have the only the Since the of the is a β is sparse, of the variance components of the L(R,distance µ) kernel need to + vectors= (xy− Gaussianrewritekthe aboveusedkone n = n ove Polynomial kernel: K(x,knormalized. a r) ,kr ≥ 0, k most equation(6.3) • a, ξ; β, =ated − the support ξ y) are non-zero, we can is theµk +N n × n, where The R to 2 (R be − φ(xk ) + )β − 2k ξ ∈ C matrix is ξ as n follows k n n 2 1 In the first ksub-step we hav k ) • K(x, x) − 2 kernel: K(x, y)β=l K(xk , xl ) = Gaussian βk K(xk , x) + β e n , qq >n 0 n 2 (s) where s is anyone of −q x−y = (6.14) ≥ 0 x) + = dRβl · · · , n. k sv sv sv th Lagrangian multipliers(x)k=l=1 0 x) − 2 µk βk K(xk ,for k 2σ 1,k2,K(xk, xl ) The posi- k=1 2 β ≥ and dR k=1 K(x, β (6.15) e• Exponential kernel: K(x, y) = k=1 x−y , q k=1 l=1point y sampled along the p e−q real constant C provides a way to control outliers> 0 13 n vector βKernel width is aLagrangian term SUPPORTassoci- percentage, allowing the is sparse, i.e. only the general multipliersindicate theMINIMUM BREGMANwhich data is VINCENZO RUSSO STATE-OF-THE-ART CLUSTERING TECHNIQUES: to VECTOR METHODS AND scale at INFORMATION PRINCIPLE
  6. 6. en a pair of data points that belong to different clusters, any β ← clusterD 5: path that c UNIVERSITÀ degli STUDI di NAPOLI FEDERICO II Support Vector Clustering 6: results ← clu ts them must exit from the sphere in the feature space. Therefore, such a p 7: choose new q tains a segment of points y such that dR (y) > R. This leads to the definition Phase II: Cluster labeling 8: end while adjacency matrix A between all pairs of points whose images lie in or on 9: return results ere in feature space. 10: end procedure The Phase I only describes the clusters’ boundaries Sij be the line segment connecting xi and xj , such that Sij = {xi+1 , xi+2 , · 2 , xj−1 } Phasei,II:= 1, 2, · · · , n, connected components of the graph for all j finding the then 6.1.5 Complexity induced by the matrix A We recall that the SVC 1 if ∀y ∈ Sij , dR (y) ≤ R cluster labeling. To calcu Aij = (6 0 otherwise. analyze each step separa sters are now defined as the connected components of the graph induced Sij = {xsegment is · · , xj−2 , xj−16.1.5.1 Cluster a num matrix A. Checking the line i+1 , xi+2 , · implemented } sampling by descrip f points between the starting point and the ending point. The exactness ofc The complexity of the Each component is a cluster (see Equation 6.3) we h ends on the number m. O(n3 ) worst-case runnin Original Phase II is a bottleneck (caso peggiore their problem can be ) arly, the BSVs are unclassified by this procedure sincethe QP feature space Alternatives s lie outside the enclosing sphere. One may decide either to leave them Sequential Minimal Optm sified orCone Cluster Labeling:cluster that they are closest to. Generally, to assign them to the best performance/accuracy methods. These me tion rate time complexity to (app Gradient Descent er is the most appropriate choice. reduced to O(1) (Ben-Hu VINCENZO RUSSO STATE-OF-THE-ART CLUSTERING TECHNIQUES: SUPPORT VECTOR METHODS AND MINIMUM BREGMAN INFORMATION PRINCIPLE
  7. 7. UNIVERSITÀ degli STUDI di NAPOLI FEDERICO II Support Vector Clustering Pseudo-hierarchical execution Parameters exploration The greater the kernel width q, the greater the number of support vectors (and so of clusters) C rules the number of outliers and allows to deal with strongly overlapping clusters Brute force approach unfeasible Approaches proposed in literature Secant-like algorithm for q exploration No theoretical-rooted method for C exploration Data analysis is performed at different levels of detail Pseudo-hierarchical: strict hierarchy not guaranteed when ‘C < 1’, due to the Bounded Support Vectors VINCENZO RUSSO STATE-OF-THE-ART CLUSTERING TECHNIQUES: SUPPORT VECTOR METHODS AND MINIMUM BREGMAN INFORMATION PRINCIPLE
  8. 8. UNIVERSITÀ degli STUDI di NAPOLI FEDERICO II Support Vector Clustering Proposed improvements Soft Margin C parameter selection Heuristics: successfully applied in 90% of cases Only 10 tests out of 100 needed further tuning 10 datasets had a high percentage of missing values New robust stop criterion Based upon Relative evaluation criteria (C-index, Dunn Index, ad hoc) Kernel width (q) selection SVC integration O(Qn3 ) O(n2 ) sv Softening strategy heuristics For all normalized kernels More kernels Exponential ( K(x, y) = e−q x−y ), Laplace (K(x, y) = e−q|x−y| ) VINCENZO RUSSO STATE-OF-THE-ART CLUSTERING TECHNIQUES: SUPPORT VECTOR METHODS AND MINIMUM BREGMAN INFORMATION PRINCIPLE
  9. 9. UNIVERSITÀ degli STUDI di NAPOLI FEDERICO II Support Vector Clustering Improvements - Stop criterion Detected clusters Actual clusters Validity index 1 3 1,00E-06 Breast Iris 3 3 0,13 4 3 0,05 1 2 1,00E-05 2 2 0,80 4 2 0,27 The bigger the Validity index the better the clustering found The stop criterion halt the process when the index value start to decrease The idea: the SVC outputs quality-increasing clusterings before reaching the optimal clustering. After that, it provides quality-decreasing partitionings. VINCENZO RUSSO STATE-OF-THE-ART CLUSTERING TECHNIQUES: SUPPORT VECTOR METHODS AND MINIMUM BREGMAN INFORMATION PRINCIPLE
  10. 10. UNIVERSITÀ degli STUDI di NAPOLI FEDERICO II Support Vector Clustering Improvements - Kernel width selection Algorithm Accuracy Macroaveraging # iter # potential “q” SVC 88,00% 87,69% 2 9 Iris + softening 94,00% 93,99% 1 13 K-means 85,33% 85,11% not applicable SVC 87,07% 87,55% 3 7 B. Cancer Syn03 Syn02 Wine + softening 93,26% 93,91% 2 6 K-means 50,00% 51,78% not applicable SVC 88,80% 100,00% 8 18 + softening 88,00% 100,00% 4 15 K-means 68,40% 63,84% not applicable SVC 87,30% 100,00% 17 36 + softening 87,30% 100,00% 6 31 K-means 39,47% 39,90% not applicable Benign Contamination SVC 91,85% 11,00% 3 11 + softening 96,71% 2,82% 3 13 K-means 60,23% 32,00% not applicable VINCENZO RUSSO STATE-OF-THE-ART CLUSTERING TECHNIQUES: SUPPORT VECTOR METHODS AND MINIMUM BREGMAN INFORMATION PRINCIPLE
  11. 11. UNIVERSITÀ degli STUDI di NAPOLI FEDERICO II Support Vector Clustering Improvements - non-Gaussian kernels Exponential Kernel: improves the cluster separation in several cases Algorithm Accuracy Macroaveraging # iter # potential “q” SVC + softening 94,00% 93,99% 1 13 Iris + Exp Kernel 97,33% 97,33% 1 15 K-means 85,33% 85,11% not applicable SVC + softening Failed - only one class out of 3 separated CLA3 + Exp Kernel 94,00% 93,99% 1 11 K-means 85,33% 85,11% not applicable Laplace Kernel: improves/allows the cluster separation with normalized data Algorithm Accuracy # iter # potential “q” SVC + softening Failed - no class separated SG03 Quad + Laplace Kernel 99,94% 1 17 K-means 83,00% not applicable SVC + softening 73,15% 3 19 + Laplace Kernel 91,04% 1 16 K-means 50,24% not applicable VINCENZO RUSSO STATE-OF-THE-ART CLUSTERING TECHNIQUES: SUPPORT VECTOR METHODS AND MINIMUM BREGMAN INFORMATION PRINCIPLE
  12. 12. φ 2 1 C, expectation of 2 its interior2 efine the relative 1interior of the set the denoted1 ri(C), as random variable X. 2 the 2 (C) the gradient of STUDI di is the dot product, and ri(S) is the relative interior of φ is UNIVERSITÀ degli φ, · NAPOLI FEDERICO II Minimum Bregman Information Principle Proposition 5.1 Let X be a random variable that takes values in X ri(C) = {x ∈ C : B(x, r) ∩ aff(C) ⊆ C following r > 0}, Bregman Co-clustering (BCC) Rd for some a positive probability distribution measure ν such that (5.2) Given a Bregman divergence dφ : S × ri(S) → [0, ∞), the problem ) is the ball of radius r and center x (Boyd and Vandenberghe, 2004, e 5.1 (Squared Euclidean Distance) clustering of both distance is perhaps Co-clustering: simultaneous Squared Euclidean rowsEand columns min [d (X, s)] lest and most widely used Bregman divergence. The underlying ri(S) ν φ φ(x) = of a data matrix s∈ function strictly convex, differentiable in Rd and Bregman framework has a unique solution given by s∗ = µ = Eν [X]. gman divergences Generalizes K-meansUsing the proposition above, we can now give a more direct strategy e Bregman divergences (Bregman, 1967), which form a large class of dLargexclass of ,divergences: Bregman 2 ,(BI). 2 ) = Bregman Information φ (x1 , 2 ) = x1 x1 − x2 , x2 − x1 − x divergences φ(x d loss functions with a number of desirable properties. Minima Bregman1 Information (MBI) principle= (5.4) = x1 , x − x2 , x2 − 5.2 (Bregman2Information) Let X be a random variab Definition x1 − x2 , 2x 1 (Bregman divergence) Let φ be a in X = {xiconvex function of Leg- a positive probability distrib real-valued }n ⊂ S 2 Rd following Meta-algorithmdom(φ)1 Let R2.=The[X] − x2n ⊆ν x ∈ ri(S) and let d : S × ri(S) → [0, = Sx1 − x2 , x ⊆ xd = x1 fined on the convex set ≡ − µ i=1 E Bregman divergence = ν i=1 i i φ → R+ is defined asconsists of all interior of a set C divergence points of C that arethe Bregmannot on the “edge” in terms of dφ is de divergence. Then intuitively Information of X of C Bregman Bregman Information d Vandenberghe, 2004, app. A). n d (x1 , x2 = φ(x1 − φ(x2 ) φ x − x to be 2 ) Iφ (X) = (5.3) (i) int(dom(φ))φ (xi , µ) roper, φclosed,) convex )function − is 1 said 2 , φ(xof Legendre type νif: φ (X, µ)] = E [d νi d mpty, with φ convex, real, dot product, and ri(S) is theon int(dom(φ)),ofand (iii) ∀zb ∈ he gradient of is strictly convex and differentiable relative interior (ii) φ, · is the differentiable i=1 φ)), limz∈dom(φ)→zb φ(z) → ∞, where dom(φ) is the domain of the φ application, d Example 5.3 (Variance) Let X = {xi }n be a set in R , and con i=1 φ)) is the interior of the domain of φ measure over X , i.e.isνthe boundary of the domain of of X with Divergence and bd(dom(φ)) iMBI1/n. The Bregman Information Information = Algorithm ee et al., 2005c). Euclidean Variance as Bregman divergence is actually K-means distance Least Squares the variance (Squared Euclidean Distance) Squared Euclidean distance is perhaps nd most widely usedEntropy divergence. The underlying function φ(x) = Relative Bregman Mutual Information Maximum Entropy n unnamed n 1 ly convex, differentiable in R and d Iφ (X) = νi dφ (xi , µ) = 57 xi − µ 2 Itakura-Saito unnamed unnamed i=1 Lindo-Buzo-Gray n i=1 where dφ (x1 , x2 ) = x1 , x1 −STATE-OF-THE-ART CLUSTERING2 , φ(x2SUPPORT VECTOR METHODS AND MINIMUM BREGMAN INFORMATION PRINCIPLE VINCENZO RUSSO x2 , x2 − x1 − x TECHNIQUES: ) = n n
  13. 13. UNIVERSITÀ degli STUDI di NAPOLI FEDERICO II Other experiments Sparse data and missing-valued data Star/Galaxy data with missing values Dataset SVC BCC K-means # attr. affected % obj. affected MV5000 (25D) 99,02% 94,00% 71,08% 10 27,0% MV10000 (25D) 96,10% 95,60% 75,12% 10 29,0% AMV5000 (15D) 91,76% 79,46% 74,90% 6 30,0% AMV10000 (15D) 90,31% 83,51% 68,20% 6 30,0% Textual document data: sparsity and high “dimensionality” Dataset SVC BCC K-means CLASSIC3 (3303D) 99,80% 100,00% 49,80% SCI3 (9456D) failed 89,39% 39,15% PORE (13821D) failed 82,68% 45,91% VINCENZO RUSSO STATE-OF-THE-ART CLUSTERING TECHNIQUES: SUPPORT VECTOR METHODS AND MINIMUM BREGMAN INFORMATION PRINCIPLE
  14. 14. UNIVERSITÀ degli STUDI di NAPOLI FEDERICO II Other experiments Outliers Dataset SVC Best BCC K-means # objects # outliers SynDECA 02 100,00% 94,18% 68,04% 1000 112 SynDECA 03 100,00% 49,00% 39,47% 10000 1.270 9.8. MISSING VALUES IN ASTROPHYSICS: SynDECA 02 SynDECA 03 VINCENZO RUSSO STATE-OF-THE-ART CLUSTERING TECHNIQUES: SUPPORT VECTOR METHODS AND MINIMUM BREGMAN INFORMATION PRINCIPLE
  15. 15. UNIVERSITÀ degli STUDI di NAPOLI FEDERICO II Conclusions and future works Conclusions Support Vector Clustering achieves the goals Goals Application domain Robustness w.r.t. Missing-valued Data Astrophysics Robustness w.r.t. Sparse Data Textual documents Robustness w.r.t. High “dimensionality” Textual documents Robustness w.r.t. Noise/Outliers Synthetic data Other properties Application domain Automatic discovering of the number of clusters Application domain independent Whole experimental stage Nonlinear separable problems handling Arbitrary-shaped clusters handling Bregman Co-clustering achieves same goals, but the following still hold the problem of estimating the number of clusters outliers handling problem VINCENZO RUSSO STATE-OF-THE-ART CLUSTERING TECHNIQUES: SUPPORT VECTOR METHODS AND MINIMUM BREGMAN INFORMATION PRINCIPLE
  16. 16. UNIVERSITÀ degli STUDI di NAPOLI FEDERICO II Conclusions and future works Contribution SVC was made applicable in practice Complexity reduction for the kernel width selection Soft margin C parameter estimation New effective stop criterion non-Gaussian Kernels The kernel width selection was shown to be applicable to all normalized kernels Exponential and Laplacian kernel successfully used Improved accuracy Softening strategy for the kernel width selection VINCENZO RUSSO STATE-OF-THE-ART CLUSTERING TECHNIQUES: SUPPORT VECTOR METHODS AND MINIMUM BREGMAN INFORMATION PRINCIPLE
  17. 17. UNIVERSITÀ degli STUDI di NAPOLI FEDERICO II Conclusion and future works Future works 10.3. FUTURE WORK Itakura-Saito Minimum Enclosing Bregman Ball (MEBB) Generalization of the Minimum Enclosing 10.3. FUTURE WORK Ball (MEB) problem and the Bâdoiu- Clarkson (BC) algorithm with Bregman Bregman Balls L2 divergences Itakura-Saito 2 Kullbach-Leibler Squared Euclidean Fig. 2. Examples of Bregman Balls, for d = 2. Blue dots are the centers of the balls. Figure 10.1: Examples of Bregman balls. The two ones on the left are balls obtained by means of Core Vector Machines (CVM) the Itakura-Saito distance. The middle one is a classic Euclidean ball. The other two are obtained by employing F isKullback-Leibler distance F . A Bregman divergence has the following Here, the the gradient operator of (Nock and Nielsen, 2005, fig. 2). properties: it is convex in x’, always non negative, and zero iff x = x’. Whenever SVM reformulated as MEB problem d F (x) = i=1 x2 = x 2 , the corresponding divergence is the squared Euclidean i 2 distancedata, therefore we can take2 ,itwith which is associated the common the tion of the (L2 ): DF (x’, x) = x − x’ 2 into account for much research in 2 definition of a ball in an Euclidean metric space: SVC and generally in the SVM. In fact, we wish to recall that the classical BC al- Itakura-Saito L22 Kullbach-Leibler B algorithm exploited 2 ≤ r} , 2 The CVMs reformulate the SVMs as a MEB problem and we already expressedThey make use of the BC algorithm gorithm is the optimizationc,r = {x ∈ X : x − c by the already mentioned CVMs. (2) Kullback-Leibler with c ∈ S the center of the ball, and r ≥ 0 its (squared) radius. Eq. (2) Fig. 2. Examples of Bregman Balls, for d = 2. machines WORKcenters of the balls. 10.3. FUTUREare the cluster description our will of testing such Blue dots left are balls obtained by means ofstage of the SVC (see for the MEBB + CVM = Bregman Vector Machines e 10.1: Examples of Bregman balls.natural generalization to the definition of balls for arbitrary Bregman suggests a The two ones on the section 6.12). Since the Euclidean ball. The other twogeneralized to Bregmanany BC algorithm has been areusually not symmetric, diver- kura-Saito distance. The middle one is a classic since a Bregman divergence is obtained divergences. However, ploying F isKullback-Leibler distance Fr. ≥about vector 2005, fig.dual Bregman balls: Here, the the gradient∈ S and any(NockBregman divergence has the following about the SVC) could gences, the research 0and Nielsen, machines2). c operator of A define actually two (and therefore have very interesting implications. We definitely intend to explore this way. New implications for vector machines roperties: it is convex in x’, always non negative, and zero iff x = x’. Whenever d Bc,r = {x ∈ X : DF (c, x) ≤ r} , (3) (x) = i=1 x2 = x 2 , the corresponding divergence is the squared Euclidean i 2 istance (L2 ): DF10.3.2 = can take2 ,itand Bc,r = {x for much research in x − x’ 2 into extend the :SVC software ∈ X DF (x, c) ≤ r} . of the data, therefore we Improve with which is associated the common the 2 (x’, x) account (4) New implications for SVC efinition of a ball in an Euclidean metric space: and generally in theRemark In fact,F (c, wish always convex thecclassicalFBC al- is not always, but SVM. that D we x) is to recall that in while D (x, c) For the boundaryaccuracy not2 always convexperform more x, given comparisons with thealgorithmX :c,r is and≤the already (it depends on robust c), while ∂Bc,r sake of in order to hm is the optimizationc,r = {x ∈ ∂B x − c by r} , B exploited 2 mentioned CVMs. (2) other clustering algorithms, an improved and extended software,r because of for the Support CVMs reformulate thealways convex. In this paper, we we already interested in Bc is SVMs as a MEB problem and are mainly expressed ith c ∈ S the center ofconvexity of(SVC)≥ 0needed. More of the paper extends Vector Clusteringand r is The (squared) radius. Eq. and reliability is necessary. its conclusion stability (2) Adapting cluster labeling algorithms to the the ball, will of testing such machines for the DF in c.description stage of the SVC (see some results to cluster uggests a natural Moreover, it,r to the definition2 presents some examples of Bregman to this promising generalization as well. Figure implement arbitrary Bregman build Bc is important to of balls for all the key contributions balls for three n 6.12). Since the BC algorithm has been generalized to Bregmanany diver- ivergences. However, since a proposed all around the world. In fact, all analytic expressions of the technique Bregman divergence is usually not symmetric, the tests have been currently popular Bregman divergences (see Table 1 for the es, S and any r ≥about vector machines (and thereforeof m points SVC)were sampled from X . A ∈ the research performed by exploiting only some ofabout the that could 0 define actually two dual Bregmanset the characteristic and/or special contribu- the Bregman divergences divergences). Let S ⊆ X be a balls: very interesting implications. We definitely intend to explore this way. ∗ tion smallest {x ∈ X : DBregman ball ,(SEBB) for S is a Bregman ball B c∗ ,r∗ with r at time. enclosing (c, x) ≤ r} Bc,r = (3) F the minimal real such that S ⊆ Bc∗ ,r∗ . With a slight abuse of language, we will L2 refer to {x ∈ X : DF (x, c) ≤ r} . Bc,r = rKullbach-Leibler ∗ (4) 2 Improve and extend as the radius of the ball. Our objective is to approximate as best as 2 the SVC software possible the SEBB of S, which amounts to minimizing the radius of the enclosing Remark that DF (c, x) is always convex in c while DF (x, c) is not always, but he boundaryaccuracy not always convexperform matterx, givenindeed, the SEBB is unique. he sake of for d c,r2. Blue dots order to (it depends on robust comparisons,r man Balls, ∂B = is and in are the centers of the balls. ball we build. As a simple more of fact c), while ∂Bc with man balls. The two ones on an improved and extended software for the Support always convex. In this the left are balls obtained by means of clustering algorithms, paper, we are mainlyenclosing Bregman ball Bc∗ ,r∗ of S is unique. Lemma 1. The smallest interested in Bc,r because of Euclidean ball. The other two are obtained rmiddle one is ofclassicin c. The conclusion stability and reliability is necessary. a he Clustering (SVC) is needed. More of the paper extends some results to convexity over, it,r of DF ibler distance F . A Bregman divergence has the following t operatoras well. Figure implement fig. the key contributions to this promising (Nock and Nielsen, 2005, 2). uild Bc is important to 2 presents some examples of Bregman balls for three all The End n x’, always non all around the world. =1x’. Whenevertestsexpressions of the opular Bregmannegative, and(see Table In fact, all analytic have been currently VINCENZO zero iff x for the theSTATE-OF-THE-ART CLUSTERING TECHNIQUES: SUPPORT VECTOR ique proposed divergences RUSSO , the corresponding divergence is the squared Euclidean METHODS AND MINIMUM BREGMAN INFORMATION PRINCIPLE
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×