Overlapping correlation clustering

1,749
-1

Published on

Overlapping clustering, where a data point can be assigned to more than one cluster, is desirable in various applications, such as bioinformatics, information retrieval, and social network analysis. In this paper we generalize the framework of correlation clustering to deal with overlapping clusters. In short, we formulate an optimization problem in which each point in the dataset is mapped to a small set of labels, representing membership in different clusters. The number of labels does not have to be the same for all data points. The objective is to find a mapping so that the distances between points in the dataset agree as much as possible with distances taken over their sets of labels. For defining distances between sets of labels, we consider two measures: set-intersection indicator and the Jaccard coefficient.

To solve the problem we propose a local-search algorithm. Iterative improvement within our algorithm gives rise to non-trivial optimization problems, which, for the measures of set intersection and Jaccard, we solve using a greedy method and non-negative least squares, respectively.

Published in: Technology, Business
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,749
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
17
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Overlapping correlation clustering

  1. 1. overlapping correlation clustering francesco bonchi aris gionis antti ukkonen yahoo! research barcelonaMonday, September 26, 2011
  2. 2. overlapping clusters are very natural - social networks - proteins - documents 2Monday, September 26, 2011
  3. 3. most clustering algorithms produce disjoint partitions 3Monday, September 26, 2011
  4. 4. overlapping is conceptually challenging to formulate - why assign a point to a further center? - why/how to generate less good clusters? 4Monday, September 26, 2011
  5. 5. correlation clustering Ccc () = |s(u, v) − I((u) = (v))| 5Monday, September 26, 2011
  6. 6. 6Monday, September 26, 2011
  7. 7. 1 7Monday, September 26, 2011
  8. 8. 0 8Monday, September 26, 2011
  9. 9. 0.33 ??? 9Monday, September 26, 2011
  10. 10. 0.33 10Monday, September 26, 2011
  11. 11. 0.33 multiple labels = multi-cluster assignment 10Monday, September 26, 2011
  12. 12. 0.5 11Monday, September 26, 2011
  13. 13. 0.67 12Monday, September 26, 2011
  14. 14. 54167/108301 ??? 13Monday, September 26, 2011
  15. 15. overlapping correlation clustering Cocc () = |s(u, v) − H((u), (v))| 14Monday, September 26, 2011
  16. 16. comparing sets of labels H((u), (v)) - Jaccard coefficient - set intersection indicator 15Monday, September 26, 2011
  17. 17. overlapping correlation clustering correlation clustering 16Monday, September 26, 2011
  18. 18. overlapping correlation clustering correlation clustering set of labels L, |L| = k 16Monday, September 26, 2011
  19. 19. overlapping correlation clustering correlation clustering set of labels L, |L| = k (u) ∈ L (u) ⊆ L 16Monday, September 26, 2011
  20. 20. overlapping correlation clustering correlation clustering set of labels L, |L| = k (u) ∈ L (u) ⊆ L C() = |s(u, v) − H((u), (v))| 16Monday, September 26, 2011
  21. 21. overlapping correlation clustering correlation clustering set of labels L, |L| = k (u) ∈ L (u) ⊆ L C() = |s(u, v) − H((u), (v))| |(u)| ≤ p 16Monday, September 26, 2011
  22. 22. dimensionality reduction - mapping to sets instead of vectors 17Monday, September 26, 2011
  23. 23. 18Monday, September 26, 2011
  24. 24. 18Monday, September 26, 2011
  25. 25. 18Monday, September 26, 2011
  26. 26. 18Monday, September 26, 2011
  27. 27. 18Monday, September 26, 2011
  28. 28. 18Monday, September 26, 2011
  29. 29. v u y x 19Monday, September 26, 2011
  30. 30. v u y x (u,v) (x,y) 19Monday, September 26, 2011
  31. 31. u∈V {v} expresses the error incurred by vertex v when it has the labels Now, giv (v), and the remaining nodes are labeled according to . The J(X, Sj subscript p in Cv,p serves to remind us that the set (v) should have at most p labels. Our general local-search strategy is summarized in Algorithm 1. Algorithm 1 LocalSearch which is 1: initialize to a valid labeling; 2: while Cocc (V, ) decreases do 3: for each v ∈ V do and we 4: find the label set L that minimizes Cv,p (L | ); We obse 5: update so that (v) = L; to the un 6: return xi and t Equation Line 4 is the step in which LocalSearch seeks to find an propose optimal set of labels for an object v by solving Equation (3). constrain This is also the place that our framework differentiates be- squares tween the measures of Jaccard coefficient and set-intersection. optimiza B. Local step for Jaccard coefficient variables Problem 3 (JACCARD - TRIANGULATION): Consider the set20 The s {S , z }Monday, September 26, 2011 , where S are subsets of a ground set U = drawbac
  32. 32. Jaccard triangulation given {Sj , zj }j=1...n find X ⊆ U to minimize n d(X, {Sj , zj }j=1...n ) = |J(X, Sj ) − zj | j=1 21Monday, September 26, 2011
  33. 33. set-intersection indicator hit-n-miss sets √ O( n log n) approximation greedy approach 22Monday, September 26, 2011
  34. 34. experimental evaluation 23Monday, September 26, 2011
  35. 35. EMOTION: 593 objects, 6 labels YEAST: 2417 objects, 14 labels 24Monday, September 26, 2011
  36. 36. EMOTION YEAST 0.2 1 0.2 1 Precision Recall Precision Recall Cost/edge Cost/edge cost cost 0.1 prec 0.8 0.1 prec 0.9 rec rec 0 0.6 0 0.8 2 4 6 8 10 5 10 15 20 k k EMOTION YEAST F 0.1 1 1 r Precision Recall Precision Recall I Cost/edge Cost/edge cost 0.1 cost prec 0.5 prec 0.9 rec rec c p d 0 0 0 0.8 2 p 4 6 2 4 6 p 8 10 12 14 s t Fig. 1. Cost per edge, precision and recall of OCC-JACC as a function of 25Monday, September 26, 2011
  37. 37. EMOTION YEAST 0.05 0.6 0.04 1 Precision Recall Precision Recall 0.4 Cost/edge Cost/edge cost cost prec 0.03 prec rec 1 rec 0.8 0 0.02 0.8 0.01 0.02 0.03 0.04 0.01 0.02 0.03 0.04 q q Fig. 3. Pruning experiment using OCC-JACC. Cost/edge and precision and recall as a function of the pruning threshold q. In fact, the results for YEAST shown in figures 1 and 2 were 26 computed with q = 0.05. In terms of computational speedupMonday, September 26, 2011
  38. 38. protein clustering - pairwise similarities based on matching of amino-acid sequences - compare using a hand-made taxonomy 27Monday, September 26, 2011
  39. 39. TABLE II Precision, recall, and their harmonic mean F-score, for non-overlapping C non-overlapping clusterings of protein sequence datasets computed using SCPS [14] and the OCC algorithms. BL is the precision of a baseline that assigns all sequences to the same cluster. BL SCPS OCC-ISECT OCC-JACC dataset prec prec/recall/F-score prec/recall/F-score prec/recall/F-score D1 0.21 0.56 / 0.82 / 0.664 0.70 / 0.67 / 0.683 0.57 / 0.55 / 0.561 D2 0.17 0.59 / 0.89 / 0.708 0.86 / 0.83 / 0.844 0.64 / 0.63 / 0.637 D3 0.38 0.93 / 0.88 / 0.904 0.81 / 0.43 / 0.558 0.73 / 0.39 / 0.505 D4 0.14 0.30 / 0.64 / 0.408 0.64 / 0.56 / 0.598 0.44 / 0.39 / 0.412 Summarizing: C3 and C4 contain elks and deer that stay away from cattle (C3 moving in higher X than C4 ); C1 also contains only elks and deer, but those moves in the higher Y ex area where also the cattles move; C2 is the cattle cluster and co it contains also few elks and deer; finally C5 is another mixed be cluster which overlaps with C2 only for the cattle and28withMonday, September 26, 2011
  40. 40. overlapping TABLE IIIerlapping Comparing clusterings cost based on distance on the SCOP taxonomy, for 4] and the different values of p, the maximum number of labels per protein.gns all SCPS OCC-ISECT-p1 OCC-ISECT-p2 OCC-ISECT-p3 D1 0.231 0.196 0.194 0.193OCC-JACC D2 0.188 0.112 0.107 0.106call/F-score D3 0.215 0.214 0.214 0.231 .55 / 0.561 D4 0.289 0.139 0.133 0.139 .63 / 0.637 SCPS OCC-JACC-p1 OCC-JACC-p2 OCC-JACC-p3 .39 / 0.505 .39 / 0.412 D1 0.231 0.208 0.202 0.205 D2 0.188 0.137 0.130 0.127 D3 0.215 0.243 0.242 0.221that stay D4 0.289 0.158 0.141 0.152 C1 alsohigher Y extremely close in the taxonomy, the error should have a smalluster and cost. Following this intuition we define the SCOP similarityer mixed between two proteins as follows:and with d(lca(u, v)) sim(u, v) = , (8) max(d(u), d(v)) − 1 29 Monday, September 26, 2011
  41. 41. future work - scaling up - approximation algorithm - jaccard triangulation - more experimentation and applications 30Monday, September 26, 2011
  42. 42. thank you!Monday, September 26, 2011
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×