Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Successfully reported this slideshow.

Like this presentation? Why not share!

2,005 views

Published on

To solve the problem we propose a local-search algorithm. Iterative improvement within our algorithm gives rise to non-trivial optimization problems, which, for the measures of set intersection and Jaccard, we solve using a greedy method and non-negative least squares, respectively.

No Downloads

Total views

2,005

On SlideShare

0

From Embeds

0

Number of Embeds

807

Shares

0

Downloads

23

Comments

0

Likes

2

No embeds

No notes for slide

- 1. overlapping correlation clustering francesco bonchi aris gionis antti ukkonen yahoo! research barcelonaMonday, September 26, 2011
- 2. overlapping clusters are very natural - social networks - proteins - documents 2Monday, September 26, 2011
- 3. most clustering algorithms produce disjoint partitions 3Monday, September 26, 2011
- 4. overlapping is conceptually challenging to formulate - why assign a point to a further center? - why/how to generate less good clusters? 4Monday, September 26, 2011
- 5. correlation clustering Ccc () = |s(u, v) − I((u) = (v))| 5Monday, September 26, 2011
- 6. 6Monday, September 26, 2011
- 7. 1 7Monday, September 26, 2011
- 8. 0 8Monday, September 26, 2011
- 9. 0.33 ??? 9Monday, September 26, 2011
- 10. 0.33 10Monday, September 26, 2011
- 11. 0.33 multiple labels = multi-cluster assignment 10Monday, September 26, 2011
- 12. 0.5 11Monday, September 26, 2011
- 13. 0.67 12Monday, September 26, 2011
- 14. 54167/108301 ??? 13Monday, September 26, 2011
- 15. overlapping correlation clustering Cocc () = |s(u, v) − H((u), (v))| 14Monday, September 26, 2011
- 16. comparing sets of labels H((u), (v)) - Jaccard coefﬁcient - set intersection indicator 15Monday, September 26, 2011
- 17. overlapping correlation clustering correlation clustering 16Monday, September 26, 2011
- 18. overlapping correlation clustering correlation clustering set of labels L, |L| = k 16Monday, September 26, 2011
- 19. overlapping correlation clustering correlation clustering set of labels L, |L| = k (u) ∈ L (u) ⊆ L 16Monday, September 26, 2011
- 20. overlapping correlation clustering correlation clustering set of labels L, |L| = k (u) ∈ L (u) ⊆ L C() = |s(u, v) − H((u), (v))| 16Monday, September 26, 2011
- 21. overlapping correlation clustering correlation clustering set of labels L, |L| = k (u) ∈ L (u) ⊆ L C() = |s(u, v) − H((u), (v))| |(u)| ≤ p 16Monday, September 26, 2011
- 22. dimensionality reduction - mapping to sets instead of vectors 17Monday, September 26, 2011
- 23. 18Monday, September 26, 2011
- 24. 18Monday, September 26, 2011
- 25. 18Monday, September 26, 2011
- 26. 18Monday, September 26, 2011
- 27. 18Monday, September 26, 2011
- 28. 18Monday, September 26, 2011
- 29. v u y x 19Monday, September 26, 2011
- 30. v u y x (u,v) (x,y) 19Monday, September 26, 2011
- 31. u∈V {v} expresses the error incurred by vertex v when it has the labels Now, giv (v), and the remaining nodes are labeled according to . The J(X, Sj subscript p in Cv,p serves to remind us that the set (v) should have at most p labels. Our general local-search strategy is summarized in Algorithm 1. Algorithm 1 LocalSearch which is 1: initialize to a valid labeling; 2: while Cocc (V, ) decreases do 3: for each v ∈ V do and we 4: ﬁnd the label set L that minimizes Cv,p (L | ); We obse 5: update so that (v) = L; to the un 6: return xi and t Equation Line 4 is the step in which LocalSearch seeks to ﬁnd an propose optimal set of labels for an object v by solving Equation (3). constrain This is also the place that our framework differentiates be- squares tween the measures of Jaccard coefﬁcient and set-intersection. optimiza B. Local step for Jaccard coefﬁcient variables Problem 3 (JACCARD - TRIANGULATION): Consider the set20 The s {S , z }Monday, September 26, 2011 , where S are subsets of a ground set U = drawbac
- 32. Jaccard triangulation given {Sj , zj }j=1...n ﬁnd X ⊆ U to minimize n d(X, {Sj , zj }j=1...n ) = |J(X, Sj ) − zj | j=1 21Monday, September 26, 2011
- 33. set-intersection indicator hit-n-miss sets √ O( n log n) approximation greedy approach 22Monday, September 26, 2011
- 34. experimental evaluation 23Monday, September 26, 2011
- 35. EMOTION: 593 objects, 6 labels YEAST: 2417 objects, 14 labels 24Monday, September 26, 2011
- 36. EMOTION YEAST 0.2 1 0.2 1 Precision Recall Precision Recall Cost/edge Cost/edge cost cost 0.1 prec 0.8 0.1 prec 0.9 rec rec 0 0.6 0 0.8 2 4 6 8 10 5 10 15 20 k k EMOTION YEAST F 0.1 1 1 r Precision Recall Precision Recall I Cost/edge Cost/edge cost 0.1 cost prec 0.5 prec 0.9 rec rec c p d 0 0 0 0.8 2 p 4 6 2 4 6 p 8 10 12 14 s t Fig. 1. Cost per edge, precision and recall of OCC-JACC as a function of 25Monday, September 26, 2011
- 37. EMOTION YEAST 0.05 0.6 0.04 1 Precision Recall Precision Recall 0.4 Cost/edge Cost/edge cost cost prec 0.03 prec rec 1 rec 0.8 0 0.02 0.8 0.01 0.02 0.03 0.04 0.01 0.02 0.03 0.04 q q Fig. 3. Pruning experiment using OCC-JACC. Cost/edge and precision and recall as a function of the pruning threshold q. In fact, the results for YEAST shown in ﬁgures 1 and 2 were 26 computed with q = 0.05. In terms of computational speedupMonday, September 26, 2011
- 38. protein clustering - pairwise similarities based on matching of amino-acid sequences - compare using a hand-made taxonomy 27Monday, September 26, 2011
- 39. TABLE II Precision, recall, and their harmonic mean F-score, for non-overlapping C non-overlapping clusterings of protein sequence datasets computed using SCPS [14] and the OCC algorithms. BL is the precision of a baseline that assigns all sequences to the same cluster. BL SCPS OCC-ISECT OCC-JACC dataset prec prec/recall/F-score prec/recall/F-score prec/recall/F-score D1 0.21 0.56 / 0.82 / 0.664 0.70 / 0.67 / 0.683 0.57 / 0.55 / 0.561 D2 0.17 0.59 / 0.89 / 0.708 0.86 / 0.83 / 0.844 0.64 / 0.63 / 0.637 D3 0.38 0.93 / 0.88 / 0.904 0.81 / 0.43 / 0.558 0.73 / 0.39 / 0.505 D4 0.14 0.30 / 0.64 / 0.408 0.64 / 0.56 / 0.598 0.44 / 0.39 / 0.412 Summarizing: C3 and C4 contain elks and deer that stay away from cattle (C3 moving in higher X than C4 ); C1 also contains only elks and deer, but those moves in the higher Y ex area where also the cattles move; C2 is the cattle cluster and co it contains also few elks and deer; ﬁnally C5 is another mixed be cluster which overlaps with C2 only for the cattle and28withMonday, September 26, 2011
- 40. overlapping TABLE IIIerlapping Comparing clusterings cost based on distance on the SCOP taxonomy, for 4] and the different values of p, the maximum number of labels per protein.gns all SCPS OCC-ISECT-p1 OCC-ISECT-p2 OCC-ISECT-p3 D1 0.231 0.196 0.194 0.193OCC-JACC D2 0.188 0.112 0.107 0.106call/F-score D3 0.215 0.214 0.214 0.231 .55 / 0.561 D4 0.289 0.139 0.133 0.139 .63 / 0.637 SCPS OCC-JACC-p1 OCC-JACC-p2 OCC-JACC-p3 .39 / 0.505 .39 / 0.412 D1 0.231 0.208 0.202 0.205 D2 0.188 0.137 0.130 0.127 D3 0.215 0.243 0.242 0.221that stay D4 0.289 0.158 0.141 0.152 C1 alsohigher Y extremely close in the taxonomy, the error should have a smalluster and cost. Following this intuition we deﬁne the SCOP similarityer mixed between two proteins as follows:and with d(lca(u, v)) sim(u, v) = , (8) max(d(u), d(v)) − 1 29 Monday, September 26, 2011
- 41. future work - scaling up - approximation algorithm - jaccard triangulation - more experimentation and applications 30Monday, September 26, 2011
- 42. thank you!Monday, September 26, 2011

No public clipboards found for this slide

Be the first to comment