Overlapping correlation clustering
Upcoming SlideShare
Loading in...5
×
 

Overlapping correlation clustering

on

  • 1,644 views

Overlapping clustering, where a data point can be assigned to more than one cluster, is desirable in various applications, such as bioinformatics, information retrieval, and social network analysis. ...

Overlapping clustering, where a data point can be assigned to more than one cluster, is desirable in various applications, such as bioinformatics, information retrieval, and social network analysis. In this paper we generalize the framework of correlation clustering to deal with overlapping clusters. In short, we formulate an optimization problem in which each point in the dataset is mapped to a small set of labels, representing membership in different clusters. The number of labels does not have to be the same for all data points. The objective is to find a mapping so that the distances between points in the dataset agree as much as possible with distances taken over their sets of labels. For defining distances between sets of labels, we consider two measures: set-intersection indicator and the Jaccard coefficient.

To solve the problem we propose a local-search algorithm. Iterative improvement within our algorithm gives rise to non-trivial optimization problems, which, for the measures of set intersection and Jaccard, we solve using a greedy method and non-negative least squares, respectively.

Statistics

Views

Total Views
1,644
Views on SlideShare
903
Embed Views
741

Actions

Likes
2
Downloads
8
Comments
0

6 Embeds 741

http://recerca.upc.edu 634
http://www.larca.cat 89
https://recerca.upc.edu 13
https://www.larca.cat 3
http://translate.googleusercontent.com 1
http://webcache.googleusercontent.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Overlapping correlation clustering Overlapping correlation clustering Presentation Transcript

  • overlapping correlation clustering francesco bonchi aris gionis antti ukkonen yahoo! research barcelonaMonday, September 26, 2011
  • overlapping clusters are very natural - social networks - proteins - documents 2Monday, September 26, 2011
  • most clustering algorithms produce disjoint partitions 3Monday, September 26, 2011
  • overlapping is conceptually challenging to formulate - why assign a point to a further center? - why/how to generate less good clusters? 4Monday, September 26, 2011
  • correlation clustering ￿ Ccc (￿) = |s(u, v) − I(￿(u) = ￿(v))| 5Monday, September 26, 2011
  • 6Monday, September 26, 2011
  • 1 7Monday, September 26, 2011
  • 0 8Monday, September 26, 2011
  • 0.33 ??? 9Monday, September 26, 2011
  • 0.33 10Monday, September 26, 2011
  • 0.33 multiple labels = multi-cluster assignment 10Monday, September 26, 2011
  • 0.5 11Monday, September 26, 2011
  • 0.67 12Monday, September 26, 2011
  • 54167/108301 ??? 13Monday, September 26, 2011
  • overlapping correlation clustering ￿ Cocc (￿) = |s(u, v) − H(￿(u), ￿(v))| 14Monday, September 26, 2011
  • comparing sets of labels H(￿(u), ￿(v)) - Jaccard coefficient - set intersection indicator 15Monday, September 26, 2011
  • overlapping correlation clustering correlation clustering 16Monday, September 26, 2011
  • overlapping correlation clustering correlation clustering set of labels L, |L| = k 16Monday, September 26, 2011
  • overlapping correlation clustering correlation clustering set of labels L, |L| = k ￿(u) ∈ L ￿(u) ⊆ L 16Monday, September 26, 2011
  • overlapping correlation clustering correlation clustering set of labels L, |L| = k ￿(u) ∈ L ￿(u) ⊆ L ￿ C(￿) = |s(u, v) − H(￿(u), ￿(v))| 16Monday, September 26, 2011
  • overlapping correlation clustering correlation clustering set of labels L, |L| = k ￿(u) ∈ L ￿(u) ⊆ L ￿ C(￿) = |s(u, v) − H(￿(u), ￿(v))| |￿(u)| ≤ p 16Monday, September 26, 2011
  • dimensionality reduction - mapping to sets instead of vectors 17Monday, September 26, 2011
  • 18Monday, September 26, 2011
  • 18Monday, September 26, 2011
  • 18Monday, September 26, 2011
  • 18Monday, September 26, 2011
  • 18Monday, September 26, 2011
  • 18Monday, September 26, 2011
  • v u y x 19Monday, September 26, 2011
  • v u y x (u,v) (x,y) 19Monday, September 26, 2011
  • u∈V {v} expresses the error incurred by vertex v when it has the labels Now, giv ￿(v), and the remaining nodes are labeled according to ￿. The J(X, Sj subscript p in Cv,p serves to remind us that the set ￿(v) should have at most p labels. Our general local-search strategy is summarized in Algorithm 1. Algorithm 1 LocalSearch which is 1: initialize ￿ to a valid labeling; 2: while Cocc (V, ￿) decreases do 3: for each v ∈ V do and we 4: find the label set L that minimizes Cv,p (L | ￿); We obse 5: update ￿ so that ￿(v) = L; to the un 6: return ￿ xi and t Equation Line 4 is the step in which LocalSearch seeks to find an propose optimal set of labels for an object v by solving Equation (3). constrain This is also the place that our framework differentiates be- squares tween the measures of Jaccard coefficient and set-intersection. optimiza B. Local step for Jaccard coefficient variables Problem 3 (JACCARD - TRIANGULATION): Consider the set20 The s {￿S , z ￿}Monday, September 26, 2011 , where S are subsets of a ground set U = drawbac
  • Jaccard triangulation given {￿Sj , zj ￿}j=1...n find X ⊆ U to minimize n ￿ d(X, {￿Sj , zj ￿}j=1...n ) = |J(X, Sj ) − zj | j=1 21Monday, September 26, 2011
  • set-intersection indicator hit-n-miss sets √ O( n log n) approximation greedy approach 22Monday, September 26, 2011
  • experimental evaluation 23Monday, September 26, 2011
  • EMOTION: 593 objects, 6 labels YEAST: 2417 objects, 14 labels 24Monday, September 26, 2011
  • EMOTION YEAST 0.2 1 0.2 1 Precision & Recall Precision & Recall Cost/edge Cost/edge cost cost 0.1 prec 0.8 0.1 prec 0.9 rec rec 0 0.6 0 0.8 2 4 6 8 10 5 10 15 20 k k EMOTION YEAST F 0.1 1 1 r Precision & Recall Precision & Recall I Cost/edge Cost/edge cost 0.1 cost prec 0.5 prec 0.9 rec rec c p d 0 0 0 0.8 2 p 4 6 2 4 6 p 8 10 12 14 s t Fig. 1. Cost per edge, precision and recall of OCC-JACC as a function of 25Monday, September 26, 2011
  • EMOTION YEAST 0.05 0.6 0.04 1 Precision & Recall Precision & Recall 0.4 Cost/edge Cost/edge cost cost prec 0.03 prec rec 1 rec 0.8 0 0.02 0.8 0.01 0.02 0.03 0.04 0.01 0.02 0.03 0.04 q q Fig. 3. Pruning experiment using OCC-JACC. Cost/edge and precision and recall as a function of the pruning threshold q. In fact, the results for YEAST shown in figures 1 and 2 were 26 computed with q = 0.05. In terms of computational speedupMonday, September 26, 2011
  • protein clustering - pairwise similarities based on matching of amino-acid sequences - compare using a hand-made taxonomy 27Monday, September 26, 2011
  • TABLE II Precision, recall, and their harmonic mean F-score, for non-overlapping C non-overlapping clusterings of protein sequence datasets computed using SCPS [14] and the OCC algorithms. BL is the precision of a baseline that assigns all sequences to the same cluster. BL SCPS OCC-ISECT OCC-JACC dataset prec prec/recall/F-score prec/recall/F-score prec/recall/F-score D1 0.21 0.56 / 0.82 / 0.664 0.70 / 0.67 / 0.683 0.57 / 0.55 / 0.561 D2 0.17 0.59 / 0.89 / 0.708 0.86 / 0.83 / 0.844 0.64 / 0.63 / 0.637 D3 0.38 0.93 / 0.88 / 0.904 0.81 / 0.43 / 0.558 0.73 / 0.39 / 0.505 D4 0.14 0.30 / 0.64 / 0.408 0.64 / 0.56 / 0.598 0.44 / 0.39 / 0.412 Summarizing: C3 and C4 contain elks and deer that stay away from cattle (C3 moving in higher X than C4 ); C1 also contains only elks and deer, but those moves in the higher Y ex area where also the cattles move; C2 is the cattle cluster and co it contains also few elks and deer; finally C5 is another mixed be cluster which overlaps with C2 only for the cattle and28withMonday, September 26, 2011
  • overlapping TABLE IIIerlapping Comparing clusterings cost based on distance on the SCOP taxonomy, for 4] and the different values of p, the maximum number of labels per protein.gns all SCPS OCC-ISECT-p1 OCC-ISECT-p2 OCC-ISECT-p3 D1 0.231 0.196 0.194 0.193OCC-JACC D2 0.188 0.112 0.107 0.106call/F-score D3 0.215 0.214 0.214 0.231 .55 / 0.561 D4 0.289 0.139 0.133 0.139 .63 / 0.637 SCPS OCC-JACC-p1 OCC-JACC-p2 OCC-JACC-p3 .39 / 0.505 .39 / 0.412 D1 0.231 0.208 0.202 0.205 D2 0.188 0.137 0.130 0.127 D3 0.215 0.243 0.242 0.221that stay D4 0.289 0.158 0.141 0.152 C1 alsohigher Y extremely close in the taxonomy, the error should have a smalluster and cost. Following this intuition we define the SCOP similarityer mixed between two proteins as follows:and with d(lca(u, v)) sim(u, v) = , (8) max(d(u), d(v)) − 1 29 Monday, September 26, 2011
  • future work - scaling up - approximation algorithm - jaccard triangulation - more experimentation and applications 30Monday, September 26, 2011
  • thank you!Monday, September 26, 2011