• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Collaborative Similarity Measure for Intra-Graph Clustering
 

Collaborative Similarity Measure for Intra-Graph Clustering

on

  • 550 views

 

Statistics

Views

Total Views
550
Views on SlideShare
548
Embed Views
2

Actions

Likes
1
Downloads
8
Comments
2

1 Embed 2

http://dke.khu.ac.kr 2

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel

12 of 2 previous next

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Many graphs with vertex attributes include social networks, World Wide Web, sensor networks, and etc.Let’s look at an example of a coauthor network of the top 200 authors on technology-enhanced learning from DBLP where a vertex represents an author and an edge represents the coauthor relationship between two authors. Each author contains multiple attributes: ID, Name, Affiliation, Research Interests, the number of coauthors, the number of publications, and etc.
  • There are mainly three approaches: structure based clustering,OLAP-style graph aggregation, structural/attribute clustering. Structure based clustering includes, for example, normalized cuts by Shi and Malik, modularity by Newman and Girvan and Scan byXu et al.. It only considers structure similarity but ignore the vertex attribute. Therefore, the clusters generated have a rather random distribution of vertex properties withinclusters.For the second approach, there is a recent study K-SNAP by Tian et al.. It follows the attributes compatible grouping. As a result, the clusters generated have a rather loose intra-cluster structure.
  • There are mainly three approaches: structure based clustering,OLAP-style graph aggregation, structural/attribute clustering. Structure based clustering includes, for example, normalized cuts by Shi and Malik, modularity by Newman and Girvan and Scan byXu et al.. It only considers structure similarity but ignore the vertex attribute. Therefore, the clusters generated have a rather random distribution of vertex properties withinclusters.For the second approach, there is a recent study K-SNAP by Tian et al.. It follows the attributes compatible grouping. As a result, the clusters generated have a rather loose intra-cluster structure.
  • In this paper, we will study the problem of “An Intra-Graph Clustering Based on Collaborative Similarity Measure”.Two fold objectives are:A desired clustering should achieve a good balance between the following two properties: The first is structural cohesiveness, which means vertices within one cluster are close to each otherin terms of structure, while vertices between clusters are distantfrom each other. The second is attribute homogeneity, which says vertices within one cluster have similarattribute values, while vertices between clusters have quitedifferent attribute values.And should be scalable to medium (and large) scale graphs [in terms of time complexity without compromising on the quality of the results].
  • For the structure-based clustering, although vertices within clusters are closely connected, they could have quite attribute values.For the attribute-based clustering, although vertices within clusters have the same attribute values, much structure information may be lost.For the structural/attribute clustering, both vertices within clusters are homogeneous, and vertices within clusters are closely connected and the graph keeps most structure information.On the other hand, Intra-Graph Clustering consider both factors (Structure and Homogeneity) for even medium scale graphs (Comparatively performs better in time as compared to the state of the art method)

Collaborative Similarity Measure for Intra-Graph Clustering Collaborative Similarity Measure for Intra-Graph Clustering Presentation Transcript

  • The 17th International Conference on Database Systems for Advanced Applications, Busan, South Korea. The 3rd International Workshop on Social Networks and Social Web Mining* Collaborative Similarity Measure for Intra-Graph Clustering* Waqas Nawaz, Young-Koo Lee, Sungyoung Lee Department of Computer Engineering, Kyung Hee University, Korea Thursday, April 19, 2012 Presenter Waqas Nawaz Data and Knowledge Engineering (DKE) Lab, Kyung Hee University Korea
  • Agenda Motivation Related Work Proposed Method (CSM-IGC) Experiments Conclusion & Future DirectionsData & Knowledge Engineering Lab 2
  • Graphs with Multiple Attributes Attribute of Authors Coauthor Network of Top 200 Authors on TEL from DBLP from manyeyes.alphaworks.ibm.comData & Knowledge Engineering Lab 3
  • Related Work Structure based clustering  Normalized cuts [Shi and Malik, TPAMI 2000]  Modularity [Newman and Girvan, Phys. Rev. 2004]  Scan [Xu et al., KDD07] The clusters generated have a rather random distribution of vertex properties within clusters OLAP-style graph aggregation  K-SNAP [Tian et al., SIGMOD’08]  Attributes compatible grouping The clusters generated have a rather loose intra-cluster structure Data & Knowledge Engineering Lab 4
  • Example: A Coauthor Network r1. XML *https://wiki.engr.illinois.edu/download/attachments/186384385/VLDB09_notes.ppt r3. XML, Skyline r2. XML r4. XML r5. XML r6. XML r9. Skyline r10. Skyline r11. Skyline r7. XML r8. XML Attribute-based Cluster Structure-basedCluster Traditional Coauthor graph Structural/Attribute ClusterData & Knowledge Engineering Lab 5
  • Related Work (cont…) Structure/Attribute based clustering  SA-Cluster [Yang Zhou et al., VLDB’ 2009] • Modify the structure of the original graph – add dummy vertex w.r.t each attribute instance – Sparse matrix and space inefficient • Neighborhood random walk: Matrix multiplication is performed iteratively • Fixed edge weights, and automatically update attribute weights Scalability issue for medium & large graphs (time complexity) Data & Knowledge Engineering Lab 6
  • Two-Fold Objective A desired clustering of attributed graph should achieve a good balance between the following:  Structural cohesiveness: Vertices within one cluster are close to each other in terms of structure, while vertices between clusters are distant from each other  Attribute homogeneity: Vertices within one cluster have similar attribute values, while vertices between clusters have quite different attribute values And it should be Scalable to medium scale graphs Data & Knowledge Engineering Lab 7
  • Different Graph Clustering Approaches Structure-based Clustering  Vertices with heterogeneous values in a cluster Attribute-based Clustering  Lose much structure information Structural/Attribute Cluster  Homogeneous vertices along structure information at the expense time complexity Intra-Graph Clustering  Scalable while considering both aspects Data & Knowledge Engineering Lab 8
  • Proposed Solution System Architecture Diagram INPUT Processing Phase OUTPUT Data & Knowledge Engineering Lab 9
  • Phase 1  Similarity Estimation (Inspired by Jaccard Index1)  Interaction of vertices (topology or structure) • Weighted fraction of shared neighbors • It will be zero for disconnected vertices • Example: Structural similarity among – SIM(V1, V2) = (1/3)*5 = 1.667 – SIM(V1, V3) = (1/4)*4 = 1.0 – SIM(V2, V3) = (1/4)*3 = 0.75 – V1 & V4 = (1/4)*0 = 0.0 • Transitive Property…! – SIM(V1, V4) = SIM(V1,V3) * SIM(V3,V4)1P. Jaccard, Etude Comparative de la Distribution Florale dans une Portion des Alpes et des Jura., Soci`et`e Vaudoise des Sciences Naturelles, Vol.37, (1901) Data & Knowledge Engineering Lab 10
  • Transitive Property Data & Knowledge Engineering Lab 11
  • Phase 1 (cont…)  Similarity Estimation (Inspired by Jaccard Index1)  Context of vertices (attributes regularity) • Weighted fraction of shared attributes instances • It will be zero for contextually disjoint vertices • Example: Contextual similarity among – Lets Wa1 = 1 and Wa2 = 2 then – SIM(V1, V3) = (2/2) = 1.0 – SIM(V3, V4) = (1/2) = 0.5 – V1 & V4 = 0.01P. Jaccard, Etude Comparative de la Distribution Florale dans une Portion des Alpes et des Jura., Soci`et`e Vaudoise des Sciences Naturelles, Vol.37, (1901) Data & Knowledge Engineering Lab 12
  • Collaborative Similarity Measure Structural Contextual Collaborative Measure Data & Knowledge Engineering Lab 13
  • Phase 2 Clustering (K-Medoid Approach) Data & Knowledge Engineering Lab 14
  • Algorithm DetailsSingle Pass Similarity Calculation Iterative Node Clustering Data & Knowledge Engineering Lab 15
  • ExampleFig. 3. Scenarios for similaritybetween source (green) anddestination(red) nodesfollowing some intermediatenodes (yellow) (a) No directpath exist (b) Directlyconnected (c) In-directlyconnected, shortest path (a) (b) (c)Table 2. (a) Collaborative vertex V1 V2 V3 V4 V5 V6 K Clustered Vertices Density EntropySimilarity among vertices given V1 1 2.67 1.17 0.20 0.18 0.18in Fig. 3-c using Collaborative 𝐂𝐒𝐢𝐦 𝒗 𝒂 , 𝒗 𝒃 2 {V1,V2,V3},{V4,V5,V6} 0.42 0.133 V2 2.67 1 0.92 0.15 0.14 0.14Similarity Measure, (b) V3 1.17 0.92 1 0.17 0.15 0.15Clustering results by varying 3 {V1,V3},{V2},{V4,V5,V6} 0.28 0.084 V4 0.2 0.15 0.17 1 0.92 0.92number of clusters (K), qualityof each measure is calculated V5 0.18 0.14 0.15 0.92 1 2.5 4 {V5},{V6},{V4},{V1,V2,V3} 0.21 0.084using Density and Entropy V6 0.18 0.14 0.15 0.92 2.5 1 (a) (b) Data & Knowledge Engineering Lab 16
  • Experiments Real Dataset  Political Blogs Dataset: 1490 vertices, 19090 edges, one attribute political leaning • Liberal • Conservative Methods  K-SNAP: Attributes only  S-Cluster: Structure-based clustering  W-Cluster: Weighted random walk strategy  SA-Cluster: Consider both factors (matrix manipulation)  IGC-CSM: Our proposed method Data & Knowledge Engineering Lab 17
  • Evaluation Metrics  Density*: intra-cluster structural cohesiveness  Entropy*: intra-cluster attribute homogeneity*Yang Zhou et al.,Graph Clustering Based on Structural/Attribute Similarities,Proceedings of VLDB Endowment,France (2009) Data & Knowledge Engineering Lab 18
  • Evaluation Metrics (cont…)  F-Measure*: has the ability to evaluate the collective qualitative nature of the formed cluster*Tijn Witsenburg et al., Improving the Accuracy of Similarity Measures by Using Link Information, International Symposium on Methodologies for Intelligent Systems Edition 9, Poland (2011) Data & Knowledge Engineering Lab 19
  • Results (Time Complexity) Synthetic Dataset Graph size vs. time  Varying No. of Node *http://www-personal.umich.edu/mejn/netdata Real Dataset  Political Blog*  No. of Clusters vs. Time Data & Knowledge Engineering Lab 20
  • Results (Quality) Density Evaluation  Clusters vs. Density Value Entropy Evaluation  Clusters vs. Entropy Value Data & Knowledge Engineering Lab 21
  • Results (Quality) F-Measure Estimation  Clusters vs. F-measure Value Data & Knowledge Engineering Lab 22
  • Conclusion We study the problem of graph node clustering based on homogeneous characteristics in terms of context and topology  collaborative similarity measure to reflect the relational model among pair of vertices  k-Medoid clustering framework is adopted for grouping similar nodes The resulting solution is estimated using state of the art evaluation measures:  Density, Entropy, and F-measure Comparatively scalable to medium scale graphs without compromising on the quality of results Data & Knowledge Engineering Lab 23
  • Thanks Any wicky786@khu.ac.kr Question…? wicky786@khu.ac.kr yklee@khu.ac.kr yklee@khu.ac.kr sylee@oslab.khu.ac.kr sylee@oslab.khu.ac.krData & Knowledge Engineering Lab 24