ICDE-2015 Shortest Path Traversal Optimization and Analysis for Large Graph Clustering

ICDE 2015 Ph.D. Symposium
Shortest Path Traversal Optimization
and Analysis for Large Graph
Clustering
Mr. Waqas Nawaz Khokhar
Department of Computer Engineering
Kyung Hee University, South Korea
Email: wicky786@khu.ac.kr
Advisor: Prof. Young-Koo Lee, Ph.D.
Monday, April 13th, 2015

2
Outline
■ Introduction
 Background
 Motivation and Problem Statement
■ Related Work
■ Proposed Methodology
 Collaborative Similarity Measure
 Shortest Path Overlapped Regions
 Confined and Parallel Graph Traversals
■ Experiments and Results
■ Contributions and Future Research
■ References

3
Graphs are Ubiquitous
■ Graph: set of vertices and edges
■ Graphs are very useful for modeling variety of entities and their
inter-relationships
Social networks Protein Interactions Internet
VLSI networks Data dependencies Neighborhood graphs
Introduction Related Work
Proposed
Methodology
Experiments Conclusion

4
Graph Clustering
■ Graph Clustering
 Partition the vertices of a graph into disjoint sets such that each partition is a
well-connected/coherent group
■ Shortest Path
 A sequence of least number of edges from source to destination
■ Applications
 Discovery of protein complexes [Snel ’02, Kire ‘14]
 Community discovery in social networks [Newman ‘06]
 Image segmentation [Shi ‘00]
 Politics [Valdis ‘08]
 Web Advertisement [Derry ’08, Yao ‘09]
 Computational Linguistics [Matsuo ’06, Ichioka ‘08]
 Many more…
Many links within a
cluster, and fewer links
between clusters
Proposed
Methodology

5
Research Taxonomy: A Big Picture
Graphs
Graph Mining
Pattern Mining
Node Clustering
Partition
Hierarchical
Searching
Graph Traversal
BFS
Shortest Path
Dijkstra
Bellman-Ford
DFS
Graph Application
Biological
Social
Community
Detection
Analytics
Transportation
Graph Partition
Dijkstra Algorithm
Community Detection
Analytics
used
help
compute
Proposed
Methodology

6
Traditional Graph Clustering
■ Input: Attributed Graph
■ Process: Group Similar Vertices Together
■ Output: Clustered Graph
■ Graph Clustering is NP-Complete/Hard problem [Ref]
[Ref] Survey on Graph Clustering, by Satu Elisa Schaeffer, Computer Science Review, 2007.
Proposed
Methodology

7
Challenges
■ Big Real World Graphs
 Graph encodes rich relationships [Ref1]
■ Non-trivial Memory Requirements
 1 Million vertices  3725 GB
 1 Billion vertices  3.7 × 109 GB
• Suppose, a node pair similarity value
storage cost is 4 bytes
■ Graphs with Multiple Attributes
■ Bottleneck!!!
 N2 computations for similarity
[Ref1] Managing and Mining Billion-Node Graphs by Haixun Wang (Microsoft Research Asia) in KDD 2012 Summer School
[Ref2] manyeyes.alphaworks.ibm.com
Coauthor Network of Top 200 Authors on TEL from DBLP [Ref2]
Proposed
Methodology

8
Problem Statement
■ Optimized Graph Clustering? Given a huge graph 𝐺, find 𝐾 number
of clusters 𝐶 = {𝑐1, 𝑐2, … , 𝑐𝑖, … , 𝑐 𝐾} efficiently such that the quality (in
terms of Density and Entropy) of the resulting clusters is
maximized
 𝑐𝑖 = 𝑣1, … 𝑣𝑗, … , 𝑣 𝑚 = 𝑠𝑒𝑡 𝑜𝑓 𝑚 𝑣𝑒𝑟𝑡𝑖𝑐𝑒𝑠 where 𝑣𝑗 is a single vertex
 𝐺 = an entire graph
 Density estimates the strong connectivity among nodes in each cluster
 The semantic resemblance among vertices is determined by Entropy
■ Informal: Optimizations and Analysis for Efficiency
 Can we solve the graph clustering problem with less pair-wise similarity
computations (i.e. << N2)?
 Is there any redundant computation, if yes then how to avoid it?
 Do we really need to visit the entire graph repeatedly?
Proposed
Methodology

9
Related Work
■ Scalability
 Parallel and Distributed Frameworks [1,2]
• Overhead of inter-process/machine communication
 Disk based approach for single PC [3,4]
• Disk data management overhead: Extensive disk I/O
• Node searching is hard, so not suitable for clustering
 Relational Approach [5]
• Limited to fundamental graph operations
■ Complexity
 Restricted Neighborhood Information [6-9]
 Reduced Graph [10-13]
 Using Evolutionary Techniques [14,15]
 Sampling based Heuristics [16,17]
 Limitation: Approximate / Non Exact Solutions
Proposed
Methodology

10
Proposed Methodology: At a Glance
The time
complexity of
pair-wise
similarity
computation is
high, i.e., O(n3)
The computations are
overlapped where
vertices/edges are
visited repeatedly
Unnecessary
regions of the
graph are traversed
Computing set of
shortest paths
sequentially increase
the overall latency
CSM A
collaborative
similarity
measure
based on
shortest path
strategy
O(n2logn)
SPORE A novel
concept is
introduced to
pre-compute
and reuse
necessary
computations
Proposed
Methodology
Confined
Traversal The
graph is
physically
partitioned
cluster-wise to
limit graph
traversals
Set-based
Approach The
set of shortest
path queries
are computed
simultaneously
1 2 3 4
ChallengesOptimizations

11
Optimization 1: Collaborative Similarity Measure (CSM)
■ Objective
 A good balance between Structure and
Attribute
 It should be Scalable to moderate size
graphs
■ Idea
• Define pair-wise similarity based on similar
neighborhood
• Consider single path (shortest path) strategy
over all paths to reduce the complexity
r1. XML
r2. XMLr3. XML, Skyline
r4. XML
r5. XML
r6. XML
r7. XML r8. XML
r9. Skyline
r10. Skyline r11. Skyline
Structural/Attribute Cluster
r1. XML
r4. XML
r5. XML
r6. XML
r7. XML r8. XML
r9. Skyline
Attribute-based Cluster
r1. XML
r4. XML
r5. XML
r6. XML
r7. XML r8. XML
r9. Skyline
Structure-based Cluster
r1. XML
r4. XML
r5. XML
r6. XML
r7. XML r8. XML
r9. Skyline
Traditional Coauthor graph
Proposed
Methodology
2
Target
Source
Shortest Path
1
3
4 5
6
Images source: http://www.slideshare.net/ShiningStar786/presentation-on-graph-clustering-vldb-09

12
Optimization 2: Shortest Path Overlapped Region (SPORE)
■ Objective
 Prove the existence of shortest path overlaps
 Avoid redundant/overlapped traversals
■ Idea
 Identify and analyze the intersections among set of shortest paths
 Maintain shortest paths (as shortcut) from current traversal for reuse
• Intuition: Sub-paths of shortest paths are also shortest paths [Ref]
• Assumption: Recently visited vertices are expected to be visited again
[Ref] Introduction to Algorithms, 2nd ed., (Cormen, Leiserson, Rivest, and Stein) 2001, p. 327.
Original Graph G SPOREs Extracted from S
P-Tree Rooted at A
Augmented Graph G’
Proposed
Methodology
A C D E
B C D F
overlap
SPAE
SPBF

13
Optimization 3: Confined Traversals
■ Objective
 Avoid traversing unnecessary parts of the original graph
■ Idea
 Perform the SSSP on the sub-graph by neglecting the edges which contain at
least one vertex from other cluster
■ Application: To update the centroids efficiently during clustering
 Compute the shortest paths from each vertex to all the other vertices in the
same cluster,
• i.e., APSP(clusteri) = SSSP(Vj,clusteri) where j = 1,2, … , |clusteri|
D
B
A
C
3
2
1
2
1
3I
H
GE
F
2
8
2
3
9
1
4
4
3
3
Graph G at Iteration t
D
B
A
C
3
2
1
2
1
3I
H
GE
F
2
8
2
3
9
1
4
g1 g2
SSSP(A, g1) ~ SSSP(A, G), where g1 ⊂ G
Proposed
Methodology

14
Optimization 4: Parallelism
■ Objective
 Allow multiple SP traversal queries to be processed in parallel to reduce the
overall latency
■ Idea
 Manage the information for each traversal instance independently
• Keep the source vertex information along with its intermediate data
■ Open Challenge
 Explosion of intermediate path information for set-based approach !!!
 It becomes even worse for certain types of graphs, e.g., power-law
Original Toy Graph Element Approach Set Approach
Proposed
Methodology

15
Environment for Experiments
■ Datasets
 Real Graphs from Stanford Library1, Newman’s Network2, Enron Email
Corpus5, Political-blogs network4, Synthetic3
■ Approaches for Comparison
 Collaborative Similarity Measure
• SA-Cluster, S-Cluster, W-Cluster, K-SNAP
 SPORE and Confined Traversal
• SegTable, Hybrid (SPORE+SegTable)
■ Evaluation Criteria
 Cluster Quality: Density, Entropy
1. Stanford Collection: http://snap.stanford.edu/data/
2. Scientific Collaboration Network: http://toreopsahl.com/datasets/newman2001/
3. Santo Fortunatos Graph Generator: http://santo.fortunato.googlepages.com/inthepress2/
4. http://www-personal.umich.edu/mejn/netdata
5. http://www.edrm.net/resources/data-sets/edrm-enron-email-data-set-v2
Proposed
Methodology

16
Optimization 1 Experimental Results: CSM
■ The proposed CSM [18] achieves competitive results efficiently
Proposed
Methodology
ExecutionTimeQuality

17
■ SPORE [19] is able to produce an order of
magnitude less number of shortcuts as pre-
computed information efficiently
■ Pre-computed information is effective
Optimization 2 Experimental Results: SPORE
Collaboration Facebook Email
Generic graph processing model
Proposed
Methodology
EfficiencyEffectiveness

18
Optimization 3 & 4 Experimental Results: Confined Traversal
■ Confined Traversal [19] improves the
execution time significantly (7~10
times)
■ We observe marginal differences in
SP distance values for both strategies
■ Execution of multiple SP queries
simultaneously reduces the overall
latency by at least 50%
Time and Expansion Analysis
Effectiveness of the Restricted Traversal (sub)
over Original Graph (full)
Analyzing the Impact of Parallelism
Proposed
Methodology

19
Contributions and Future Work
■ Provide Efficient Similarity Measure
 Improves the time complexity, O(n3)  O(n2logn), based on shortest path
strategy
■ Suggest Graph Traversal Optimizations towards Clustering
 Avoid Redundant Computations: Time 40% speedup, Space an order of
magnitude
 Redefine the Search Space: 10 times faster
 Computation Independence: Latency reduction 7~10 times
■ Future Research Directions
 Avoid the intermediate data explosion during parallel graph traversals
 Take into consideration the heterogeneity of large graphs
Proposed
Methodology

20
References
[1] Bryan Perozzi et al., “Scalable-Graph-Clustering-with-Pregel”, Complex Networks, 2013.
[2] CL Staudt et al., ”Engineering High-Performance Community Detection Heuristics for Massive Graphs”, ICCP, 2013.
[3] P Sarkar et al., “Fast Nearest-neighbor Search in Disk-resident Graphs”, KDD, 2010.
[4] A Kyrola et al., “GraphChi: Large-Scale Graph Computation on Just a PC”. OSDI. Vol. 12. 2012.
[5] J Gao et al., “Relational Approach for Shortest Path Discovery over Large Graphs”, VLDB, 2012.
[6] X Fu et al., “Threshold Random Walkers for Community Detection”, Journal of Software, 2013.
[7] X Qi et al., “Optimal local community detection in social networks based on density drop of subgraphs”, Pattern
Recognition Letters, 2014.
[8] D Delling et al., “Robust Exact Distance Queries on Massive Networks”, Report by Microsoft, 2014.
[9] DA Spielman et al., “A Local Clustering Algorithm for Massive Graphs and Its Application to Nearly Linear Time Graph
Partitioning”, SIAM J. on Computing, 2013.
[10] H Shiokawa et al., “Fast Algorithm for Modularity-Based Graph Clustering”, AAAI, 2013.
[11] J Feng et al., “Compression-based Graph Mining Exploiting Structure Primitives”, ICDM, 2013.
[12] JF Rodrigues et al., “Large Graph Analysis in the GMine System”, TKDE, 2013.
[13] Y Ruan et al., “Efficient community detection in large networks using content and links”, WWW, 2013.
[14] Y Yoon et al., “Vertex Ordering, Clustering, and Their Application to Graph Partitioning”, Applied Mathematics and
information sciences, 2014.
[15] SR Mandala et al., Clustering social networks using ant colony optimization, Operational Research, 2011.
[16] B Yang et al., “Hierarchical community detection with applications to real-world network analysis”, DKE, 2013.
[17] I Rytsareva et al., “Scalable heuristics for clustering biological graphs”, IEEE Computational Advances in Bio and Medical
Sciences (ICCABS), 2013.
[18] Waqas Nawaz, Kifayat-Ullah Khan, Young-Koo Lee, and Sungyoung Lee, "Intra Graph Clustering using Collaborative
Similarity Measure", Journal of Distributed and Parallel Databases (SCIE, IF 1.0), 2015.
[19] Waqas Nawaz, Kifayat-Ullah Khan, and Young-Koo Lee, " SPORE: Shortest Path Overlapped Regions and Confined
Traversals Towards Graph Clustering", Applied Intelligence-APIN (SCI, IF 1.85), 2015.

21
Thank You!
Any Questions or Comments?

ICDE-2015 Shortest Path Traversal Optimization and Analysis for Large Graph Clustering

More Related Content

What's hot

Similar to ICDE-2015 Shortest Path Traversal Optimization and Analysis for Large Graph Clustering

More from Waqas Nawaz

Recently uploaded

ICDE-2015 Shortest Path Traversal Optimization and Analysis for Large Graph Clustering

Editor's Notes