Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
SCAN++: Efficient Algorithm for
Finding Clusters, Hubs, and Outliers
on Large-scale Graphs
Hiroaki Shiokawa†‡, Yasuhiro Fu...
Large-scale graphs are available
500 Million Tweets/ Day
320 Million Users/Month
1.49 Billion Users/Month
Graph structure...
Finding clusters, hubs and outliers
• Structural clustering: SCAN [Xu+, 2007]
– It identifies clusters, hubs and outliers ...
Clusters Definition
Cluster = Cores + Borders
• Core: Nodes that have enough neighbors with dense connections
• Structural...
Example (𝜖 = 0.7, 𝜇 = 2)
6
3
0
5
2
1
40.67
node4 is a core &
node 0,3,5 are borders
Example (𝜖 = 0.7, 𝜇 = 2)
6
3
0
5
2
1
40.67
6
0
5
2
1
4
3
Cluster
node4 is a core &
node 0,3,5 are borders
Extract a cluster
Example (𝜖 = 0.7, 𝜇 = 2)
6
3
0
5
2
1
40.67
6
0
5
2
1
4
3
Cluster
6
0
2
1
4
3
Cluster
5
node4 is a core &
node 0,3,5 are bo...
Example (𝜖 = 0.7, 𝜇 = 2)
6
3
0
5
2
1
40.67
6
0
5
2
1
4
3
Cluster
6
0
2
1
4
3
Cluster
5
• Shortcoming
Although SCAN shows b...
Challenge of this work
Our challange
Overcome the runtime limitation
without sacrificing quality!
Contributions
• Propose SCAN++: a structural clustering
algorithm
• Efficient
– Considerably faster than SCAN
• Exact
– Pr...
BRIEF SUMMARY OF
SCAN++
Our Basic Idea
• Observation of real-world graph
– “If node u is two hops away from node v, their neighbors are
likely to ...
How SCAN++ works?
• Two-phase clustering method
– (Phase 1) Find local clusters from 2-hop away nodes
– (Phase 2) Merge th...
How SCAN++ works?
• Two-phase clustering method
– (Phase 1) Find local clusters from 2-hop away nodes
– (Phase 2) Merge th...
How SCAN++ works?
• Two-phase clustering method
– (Phase 1) Find local clusters from 2-hop away nodes
– (Phase 2) Merge th...
Other optimizations
• Similarity sharing method
– Reducing the similarity computation costs by
using the graph isomorphism...
Theoretical Analysis
[Theorem 1 – Time complexity]
The time complexity of SCAN++ is 𝑂(
2−𝑐
2𝛿+𝑐
|𝔼|)
* c: strength of the ...
EXPERIMENTAL RESULTS
Experimental Settings
Datasets
Benchmark solutions
– SCAN: Original method
– SCAN*: Approximation method based on edge sam...
Results 1 – Efficiency
• SCAN++ always wins on all datasets and
• it is almost 20 times faster than SCAN
better
Results 2 - Exactness
• ARI comparisons between each algorithm and SCAN
– SCAN++ always return ARI=1 since it is theoretic...
Results 3 – Effectiveness (synthetic data)
• LFR benchmark datasets [Lancichinetti+ 2009]
– 𝕍 = 100𝐾, 𝔼 = 20𝑀
– vary the s...
Results 4 – Effectiveness (synthetic data)
• Caveman-model graph v.s. Balanced Tree
Caveman-model graphs (𝑐 ≈ 1) Balanced ...
Summary
• Goal of this work
– Overcome the runtime limitations of SCAN without
sacrificing clustering quality
• Proposed S...
Upcoming SlideShare
Loading in …5
×

SCAN++: Efficient Algorithm for Finding Clusters, Hubs and Outliers on Large-scale Graphs (VLDB 2015)

1,348 views

Published on

This is the talk slides of VLDB 2015 conference held in Kona, Hawaii.
For details, please check our paper.
http://www.vldb.org/pvldb/vol8/p1178-shiokawa.pdf

Published in: Data & Analytics

SCAN++: Efficient Algorithm for Finding Clusters, Hubs and Outliers on Large-scale Graphs (VLDB 2015)

  1. 1. SCAN++: Efficient Algorithm for Finding Clusters, Hubs, and Outliers on Large-scale Graphs Hiroaki Shiokawa†‡, Yasuhiro Fujiwara†, Makoto Onizuka§ † Nippon Telegraph and Telephone Corporation, Japan ‡ University of Tsukuba, Japan § Osaka University, Japan VLDB2015 Research Session #13
  2. 2. Large-scale graphs are available 500 Million Tweets/ Day 320 Million Users/Month 1.49 Billion Users/Month Graph structure analysis *2 https://about.twitter.com/ja/company *1 http://newsroom.fb.com/company-info/ *1 *2 *2 • Clusters • Hubs • Outliers • Social analyses • Scientific analyses • Marketing and so on…
  3. 3. Finding clusters, hubs and outliers • Structural clustering: SCAN [Xu+, 2007] – It identifies clusters, hubs and outliers at the same time – It can overcome the resolution limit problem of Modularity clustering Cluster 9 7 8 6 3 4 0 5 2 1 10 11 12 13 Hub Outlier Non-cluster nodes that bridge multiple clusters Densely connected node set Non member of clusters and hubs
  4. 4. Clusters Definition Cluster = Cores + Borders • Core: Nodes that have enough neighbors with dense connections • Structural similarity for density evaluations • Border: Densely connected neighbors of cores |)(||)(| |)()(| ),( vNuN vNuN vu   )(uN : Set of neighbor nodes By setting density threshold 𝝐 and minimum cluster size 𝝁, SCAN finds the clusters.
  5. 5. Example (𝜖 = 0.7, 𝜇 = 2) 6 3 0 5 2 1 40.67 node4 is a core & node 0,3,5 are borders
  6. 6. Example (𝜖 = 0.7, 𝜇 = 2) 6 3 0 5 2 1 40.67 6 0 5 2 1 4 3 Cluster node4 is a core & node 0,3,5 are borders Extract a cluster
  7. 7. Example (𝜖 = 0.7, 𝜇 = 2) 6 3 0 5 2 1 40.67 6 0 5 2 1 4 3 Cluster 6 0 2 1 4 3 Cluster 5 node4 is a core & node 0,3,5 are borders Extract a cluster node 5 is a core & node 0,1,2,3 are borders Expand the cluster
  8. 8. Example (𝜖 = 0.7, 𝜇 = 2) 6 3 0 5 2 1 40.67 6 0 5 2 1 4 3 Cluster 6 0 2 1 4 3 Cluster 5 • Shortcoming Although SCAN shows better clustering quality, but it requires large computation cost ≈ 𝑂( 𝔼 2)  node4 is a core & node 0,3,5 are borders Extract a cluster node 5 is a core & node 0,1,2,3 are borders Expand the cluster
  9. 9. Challenge of this work Our challange Overcome the runtime limitation without sacrificing quality!
  10. 10. Contributions • Propose SCAN++: a structural clustering algorithm • Efficient – Considerably faster than SCAN • Exact – Provide exactly same clustering results as SCAN • Effective – Improve its performance for real-world graphs that have high clustering coefficient
  11. 11. BRIEF SUMMARY OF SCAN++
  12. 12. Our Basic Idea • Observation of real-world graph – “If node u is two hops away from node v, their neighbors are likely to share large portion of nodes” • This is supported by the well-known property [Watts+, 1998]: Real world graphs have high clustering coefficient u v • node u shares 60% neighbors • node v shares 100% neighbors ↓ Proposed algorithm SCAN++ avoids the computations for the shared nodes
  13. 13. How SCAN++ works? • Two-phase clustering method – (Phase 1) Find local clusters from 2-hop away nodes – (Phase 2) Merge the local clusters 9 7 8 6 3 0 5 2 1 4 10 11 12 See paper for details
  14. 14. How SCAN++ works? • Two-phase clustering method – (Phase 1) Find local clusters from 2-hop away nodes – (Phase 2) Merge the local clusters Phase 1: Local clustering 9 7 8 6 3 0 5 2 1 4 10 11 12 9 7 8 6 3 0 5 2 1 4 10 11 12 Local cluster Computes 2-hop away nodes, and finds local clusters See paper for details
  15. 15. How SCAN++ works? • Two-phase clustering method – (Phase 1) Find local clusters from 2-hop away nodes – (Phase 2) Merge the local clusters Phase 1: Local clustering 9 7 8 6 3 0 5 2 1 4 10 11 12 Phase 2: Cluster refinement9 7 8 6 3 0 5 2 1 4 10 11 12 9 7 8 6 3 0 5 2 1 4 10 11 12 Local cluster Computes 2-hop away nodes, and finds local clusters Merge local clusters if their 2-hop away nodes share a lot of nodes See paper for details
  16. 16. Other optimizations • Similarity sharing method – Reducing the similarity computation costs by using the graph isomorphism • Parallel implementation – Proposed parallel implementation of SCAN++ by using MapReduce frameworks See paper for details
  17. 17. Theoretical Analysis [Theorem 1 – Time complexity] The time complexity of SCAN++ is 𝑂( 2−𝑐 2𝛿+𝑐 |𝔼|) * c: strength of the clustering coefficients * 𝛿: inverse number of degree * |𝔼|: # of edges v [Theorem 2 – Exactness of SCAN++] v SCAN++ always finds exactly same clusters as SCAN
  18. 18. EXPERIMENTAL RESULTS
  19. 19. Experimental Settings Datasets Benchmark solutions – SCAN: Original method – SCAN*: Approximation method based on edge sampling – gSkeletonClu: Parameter free extension of SCAN Environments – Intel Xeon Processor 2.27GHz with 144GBytes RAM – All methods ware implemented in C/C++
  20. 20. Results 1 – Efficiency • SCAN++ always wins on all datasets and • it is almost 20 times faster than SCAN better
  21. 21. Results 2 - Exactness • ARI comparisons between each algorithm and SCAN – SCAN++ always return ARI=1 since it is theoretically guaranteed to find same results as SCAN better
  22. 22. Results 3 – Effectiveness (synthetic data) • LFR benchmark datasets [Lancichinetti+ 2009] – 𝕍 = 100𝐾, 𝔼 = 20𝑀 – vary the strength of clustering coefficients from 0.1 to 0.6 better
  23. 23. Results 4 – Effectiveness (synthetic data) • Caveman-model graph v.s. Balanced Tree Caveman-model graphs (𝑐 ≈ 1) Balanced Tree (𝑐 ≈ 0)
  24. 24. Summary • Goal of this work – Overcome the runtime limitations of SCAN without sacrificing clustering quality • Proposed SCAN++ – We focused on the local cluster coefficient of graphs – Clustering methods over two-hop away nodes • Contributions – SCAN++ runs almost 20 times faster than SCAN – SCAN++ finds exactly same results as SCAN – SCAN++ effectively handle real-world properties

×