ICDE 2015 Ph.D. Symposium
Shortest Path Traversal Optimization
and Analysis for Large Graph
Clustering
Mr. Waqas Nawaz Khokhar
Department of Computer Engineering
Kyung Hee University, South Korea
Email: wicky786@khu.ac.kr
Advisor: Prof. Young-Koo Lee, Ph.D.
Monday, April 13th, 2015
2
Outline
■ Introduction
 Background
 Motivation and Problem Statement
■ Related Work
■ Proposed Methodology
 Collaborative Similarity Measure
 Shortest Path Overlapped Regions
 Confined and Parallel Graph Traversals
■ Experiments and Results
■ Contributions and Future Research
■ References
3
Graphs are Ubiquitous
■ Graph: set of vertices and edges
■ Graphs are very useful for modeling variety of entities and their
inter-relationships
Social networks Protein Interactions Internet
VLSI networks Data dependencies Neighborhood graphs
Introduction Related Work
Proposed
Methodology
Experiments Conclusion
4
Graph Clustering
■ Graph Clustering
 Partition the vertices of a graph into disjoint sets such that each partition is a
well-connected/coherent group
■ Shortest Path
 A sequence of least number of edges from source to destination
■ Applications
 Discovery of protein complexes [Snel ’02, Kire ‘14]
 Community discovery in social networks [Newman ‘06]
 Image segmentation [Shi ‘00]
 Politics [Valdis ‘08]
 Web Advertisement [Derry ’08, Yao ‘09]
 Computational Linguistics [Matsuo ’06, Ichioka ‘08]
 Many more…
Many links within a
cluster, and fewer links
between clusters
Introduction Related Work
Proposed
Methodology
Experiments Conclusion
5
Research Taxonomy: A Big Picture
Graphs
Graph Mining
Pattern Mining
Node Clustering
Partition
Hierarchical
Searching
Graph Traversal
BFS
Shortest Path
Dijkstra
Bellman-Ford
DFS
Graph Application
Biological
Social
Community
Detection
Analytics
Transportation
Graph Partition
Dijkstra Algorithm
Community Detection
Analytics
used
help
compute
Introduction Related Work
Proposed
Methodology
Experiments Conclusion
6
Traditional Graph Clustering
■ Input: Attributed Graph
■ Process: Group Similar Vertices Together
■ Output: Clustered Graph
■ Graph Clustering is NP-Complete/Hard problem [Ref]
[Ref] Survey on Graph Clustering, by Satu Elisa Schaeffer, Computer Science Review, 2007.
Introduction Related Work
Proposed
Methodology
Experiments Conclusion
7
Challenges
■ Big Real World Graphs
 Graph encodes rich relationships [Ref1]
■ Non-trivial Memory Requirements
 1 Million vertices  3725 GB
 1 Billion vertices  3.7 × 109 GB
• Suppose, a node pair similarity value
storage cost is 4 bytes
■ Graphs with Multiple Attributes
■ Bottleneck!!!
 N2 computations for similarity
[Ref1] Managing and Mining Billion-Node Graphs by Haixun Wang (Microsoft Research Asia) in KDD 2012 Summer School
[Ref2] manyeyes.alphaworks.ibm.com
Coauthor Network of Top 200 Authors on TEL from DBLP [Ref2]
Introduction Related Work
Proposed
Methodology
Experiments Conclusion
8
Problem Statement
■ Optimized Graph Clustering? Given a huge graph 𝐺, find 𝐾 number
of clusters 𝐶 = {𝑐1, 𝑐2, … , 𝑐𝑖, … , 𝑐 𝐾} efficiently such that the quality (in
terms of Density and Entropy) of the resulting clusters is
maximized
 𝑐𝑖 = 𝑣1, … 𝑣𝑗, … , 𝑣 𝑚 = 𝑠𝑒𝑡 𝑜𝑓 𝑚 𝑣𝑒𝑟𝑡𝑖𝑐𝑒𝑠 where 𝑣𝑗 is a single vertex
 𝐺 = an entire graph
 Density estimates the strong connectivity among nodes in each cluster
 The semantic resemblance among vertices is determined by Entropy
■ Informal: Optimizations and Analysis for Efficiency
 Can we solve the graph clustering problem with less pair-wise similarity
computations (i.e. << N2)?
 Is there any redundant computation, if yes then how to avoid it?
 Do we really need to visit the entire graph repeatedly?
Introduction Related Work
Proposed
Methodology
Experiments Conclusion
9
Related Work
■ Scalability
 Parallel and Distributed Frameworks [1,2]
• Overhead of inter-process/machine communication
 Disk based approach for single PC [3,4]
• Disk data management overhead: Extensive disk I/O
• Node searching is hard, so not suitable for clustering
 Relational Approach [5]
• Limited to fundamental graph operations
■ Complexity
 Restricted Neighborhood Information [6-9]
 Reduced Graph [10-13]
 Using Evolutionary Techniques [14,15]
 Sampling based Heuristics [16,17]
 Limitation: Approximate / Non Exact Solutions
Introduction Related Work
Proposed
Methodology
Experiments Conclusion
10
Proposed Methodology: At a Glance
The time
complexity of
pair-wise
similarity
computation is
high, i.e., O(n3)
The computations are
overlapped where
vertices/edges are
visited repeatedly
Unnecessary
regions of the
graph are traversed
Computing set of
shortest paths
sequentially increase
the overall latency
CSM A
collaborative
similarity
measure
based on
shortest path
strategy
O(n2logn)
SPORE A novel
concept is
introduced to
pre-compute
and reuse
necessary
computations
Introduction Related Work
Proposed
Methodology
Experiments Conclusion
Confined
Traversal The
graph is
physically
partitioned
cluster-wise to
limit graph
traversals
Set-based
Approach The
set of shortest
path queries
are computed
simultaneously
1 2 3 4
ChallengesOptimizations
11
Optimization 1: Collaborative Similarity Measure (CSM)
■ Objective
 A good balance between Structure and
Attribute
 It should be Scalable to moderate size
graphs
■ Idea
• Define pair-wise similarity based on similar
neighborhood
• Consider single path (shortest path) strategy
over all paths to reduce the complexity
r1. XML
r2. XMLr3. XML, Skyline
r4. XML
r5. XML
r6. XML
r7. XML r8. XML
r9. Skyline
r10. Skyline r11. Skyline
Structural/Attribute Cluster
r1. XML
r2. XMLr3. XML, Skyline
r4. XML
r5. XML
r6. XML
r7. XML r8. XML
r9. Skyline
r10. Skyline r11. Skyline
Attribute-based Cluster
r1. XML
r2. XMLr3. XML, Skyline
r4. XML
r5. XML
r6. XML
r7. XML r8. XML
r9. Skyline
r10. Skyline r11. Skyline
Structure-based Cluster
r1. XML
r2. XMLr3. XML, Skyline
r4. XML
r5. XML
r6. XML
r7. XML r8. XML
r9. Skyline
r10. Skyline r11. Skyline
Traditional Coauthor graph
Introduction Related Work
Proposed
Methodology
Experiments Conclusion
2
Target
Source
Shortest Path
1
3
4 5
6
Images source: http://www.slideshare.net/ShiningStar786/presentation-on-graph-clustering-vldb-09
12
Optimization 2: Shortest Path Overlapped Region (SPORE)
■ Objective
 Prove the existence of shortest path overlaps
 Avoid redundant/overlapped traversals
■ Idea
 Identify and analyze the intersections among set of shortest paths
 Maintain shortest paths (as shortcut) from current traversal for reuse
• Intuition: Sub-paths of shortest paths are also shortest paths [Ref]
• Assumption: Recently visited vertices are expected to be visited again
[Ref] Introduction to Algorithms, 2nd ed., (Cormen, Leiserson, Rivest, and Stein) 2001, p. 327.
Original Graph G SPOREs Extracted from S
P-Tree Rooted at A
Augmented Graph G’
Introduction Related Work
Proposed
Methodology
Experiments Conclusion
A C D E
B C D F
overlap
SPAE
SPBF
13
Optimization 3: Confined Traversals
■ Objective
 Avoid traversing unnecessary parts of the original graph
■ Idea
 Perform the SSSP on the sub-graph by neglecting the edges which contain at
least one vertex from other cluster
■ Application: To update the centroids efficiently during clustering
 Compute the shortest paths from each vertex to all the other vertices in the
same cluster,
• i.e., APSP(clusteri) = SSSP(Vj,clusteri) where j = 1,2, … , |clusteri|
D
B
A
C
3
2
1
2
1
3I
H
GE
F
2
8
2
3
9
1
4
4
3
3
Graph G at Iteration t
D
B
A
C
3
2
1
2
1
3I
H
GE
F
2
8
2
3
9
1
4
g1 g2
SSSP(A, g1) ~ SSSP(A, G), where g1 ⊂ G
Introduction Related Work
Proposed
Methodology
Experiments Conclusion
14
Optimization 4: Parallelism
■ Objective
 Allow multiple SP traversal queries to be processed in parallel to reduce the
overall latency
■ Idea
 Manage the information for each traversal instance independently
• Keep the source vertex information along with its intermediate data
■ Open Challenge
 Explosion of intermediate path information for set-based approach !!!
 It becomes even worse for certain types of graphs, e.g., power-law
Original Toy Graph Element Approach Set Approach
Introduction Related Work
Proposed
Methodology
Experiments Conclusion
15
Environment for Experiments
■ Datasets
 Real Graphs from Stanford Library1, Newman’s Network2, Enron Email
Corpus5, Political-blogs network4, Synthetic3
■ Approaches for Comparison
 Collaborative Similarity Measure
• SA-Cluster, S-Cluster, W-Cluster, K-SNAP
 SPORE and Confined Traversal
• SegTable, Hybrid (SPORE+SegTable)
■ Evaluation Criteria
 Cluster Quality: Density, Entropy
1. Stanford Collection: http://snap.stanford.edu/data/
2. Scientific Collaboration Network: http://toreopsahl.com/datasets/newman2001/
3. Santo Fortunatos Graph Generator: http://santo.fortunato.googlepages.com/inthepress2/
4. http://www-personal.umich.edu/mejn/netdata
5. http://www.edrm.net/resources/data-sets/edrm-enron-email-data-set-v2
Introduction Related Work
Proposed
Methodology
Experiments Conclusion
16
Optimization 1 Experimental Results: CSM
■ The proposed CSM [18] achieves competitive results efficiently
Introduction Related Work
Proposed
Methodology
Experiments Conclusion
ExecutionTimeQuality
17
■ SPORE [19] is able to produce an order of
magnitude less number of shortcuts as pre-
computed information efficiently
■ Pre-computed information is effective
Optimization 2 Experimental Results: SPORE
Collaboration Facebook Email
Generic graph processing model
Introduction Related Work
Proposed
Methodology
Experiments Conclusion
EfficiencyEffectiveness
18
Optimization 3 & 4 Experimental Results: Confined Traversal
■ Confined Traversal [19] improves the
execution time significantly (7~10
times)
■ We observe marginal differences in
SP distance values for both strategies
■ Execution of multiple SP queries
simultaneously reduces the overall
latency by at least 50%
Time and Expansion Analysis
Effectiveness of the Restricted Traversal (sub)
over Original Graph (full)
Analyzing the Impact of Parallelism
Introduction Related Work
Proposed
Methodology
Experiments Conclusion
19
Contributions and Future Work
■ Provide Efficient Similarity Measure
 Improves the time complexity, O(n3)  O(n2logn), based on shortest path
strategy
■ Suggest Graph Traversal Optimizations towards Clustering
 Avoid Redundant Computations: Time 40% speedup, Space an order of
magnitude
 Redefine the Search Space: 10 times faster
 Computation Independence: Latency reduction 7~10 times
■ Future Research Directions
 Avoid the intermediate data explosion during parallel graph traversals
 Take into consideration the heterogeneity of large graphs
Introduction Related Work
Proposed
Methodology
Experiments Conclusion
20
References
[1] Bryan Perozzi et al., “Scalable-Graph-Clustering-with-Pregel”, Complex Networks, 2013.
[2] CL Staudt et al., ”Engineering High-Performance Community Detection Heuristics for Massive Graphs”, ICCP, 2013.
[3] P Sarkar et al., “Fast Nearest-neighbor Search in Disk-resident Graphs”, KDD, 2010.
[4] A Kyrola et al., “GraphChi: Large-Scale Graph Computation on Just a PC”. OSDI. Vol. 12. 2012.
[5] J Gao et al., “Relational Approach for Shortest Path Discovery over Large Graphs”, VLDB, 2012.
[6] X Fu et al., “Threshold Random Walkers for Community Detection”, Journal of Software, 2013.
[7] X Qi et al., “Optimal local community detection in social networks based on density drop of subgraphs”, Pattern
Recognition Letters, 2014.
[8] D Delling et al., “Robust Exact Distance Queries on Massive Networks”, Report by Microsoft, 2014.
[9] DA Spielman et al., “A Local Clustering Algorithm for Massive Graphs and Its Application to Nearly Linear Time Graph
Partitioning”, SIAM J. on Computing, 2013.
[10] H Shiokawa et al., “Fast Algorithm for Modularity-Based Graph Clustering”, AAAI, 2013.
[11] J Feng et al., “Compression-based Graph Mining Exploiting Structure Primitives”, ICDM, 2013.
[12] JF Rodrigues et al., “Large Graph Analysis in the GMine System”, TKDE, 2013.
[13] Y Ruan et al., “Efficient community detection in large networks using content and links”, WWW, 2013.
[14] Y Yoon et al., “Vertex Ordering, Clustering, and Their Application to Graph Partitioning”, Applied Mathematics and
information sciences, 2014.
[15] SR Mandala et al., Clustering social networks using ant colony optimization, Operational Research, 2011.
[16] B Yang et al., “Hierarchical community detection with applications to real-world network analysis”, DKE, 2013.
[17] I Rytsareva et al., “Scalable heuristics for clustering biological graphs”, IEEE Computational Advances in Bio and Medical
Sciences (ICCABS), 2013.
[18] Waqas Nawaz, Kifayat-Ullah Khan, Young-Koo Lee, and Sungyoung Lee, "Intra Graph Clustering using Collaborative
Similarity Measure", Journal of Distributed and Parallel Databases (SCIE, IF 1.0), 2015.
[19] Waqas Nawaz, Kifayat-Ullah Khan, and Young-Koo Lee, " SPORE: Shortest Path Overlapped Regions and Confined
Traversals Towards Graph Clustering", Applied Intelligence-APIN (SCI, IF 1.85), 2015.
21
Thank You!
Any Questions or Comments?

ICDE-2015 Shortest Path Traversal Optimization and Analysis for Large Graph Clustering

  • 1.
    ICDE 2015 Ph.D.Symposium Shortest Path Traversal Optimization and Analysis for Large Graph Clustering Mr. Waqas Nawaz Khokhar Department of Computer Engineering Kyung Hee University, South Korea Email: wicky786@khu.ac.kr Advisor: Prof. Young-Koo Lee, Ph.D. Monday, April 13th, 2015
  • 2.
    2 Outline ■ Introduction  Background Motivation and Problem Statement ■ Related Work ■ Proposed Methodology  Collaborative Similarity Measure  Shortest Path Overlapped Regions  Confined and Parallel Graph Traversals ■ Experiments and Results ■ Contributions and Future Research ■ References
  • 3.
    3 Graphs are Ubiquitous ■Graph: set of vertices and edges ■ Graphs are very useful for modeling variety of entities and their inter-relationships Social networks Protein Interactions Internet VLSI networks Data dependencies Neighborhood graphs Introduction Related Work Proposed Methodology Experiments Conclusion
  • 4.
    4 Graph Clustering ■ GraphClustering  Partition the vertices of a graph into disjoint sets such that each partition is a well-connected/coherent group ■ Shortest Path  A sequence of least number of edges from source to destination ■ Applications  Discovery of protein complexes [Snel ’02, Kire ‘14]  Community discovery in social networks [Newman ‘06]  Image segmentation [Shi ‘00]  Politics [Valdis ‘08]  Web Advertisement [Derry ’08, Yao ‘09]  Computational Linguistics [Matsuo ’06, Ichioka ‘08]  Many more… Many links within a cluster, and fewer links between clusters Introduction Related Work Proposed Methodology Experiments Conclusion
  • 5.
    5 Research Taxonomy: ABig Picture Graphs Graph Mining Pattern Mining Node Clustering Partition Hierarchical Searching Graph Traversal BFS Shortest Path Dijkstra Bellman-Ford DFS Graph Application Biological Social Community Detection Analytics Transportation Graph Partition Dijkstra Algorithm Community Detection Analytics used help compute Introduction Related Work Proposed Methodology Experiments Conclusion
  • 6.
    6 Traditional Graph Clustering ■Input: Attributed Graph ■ Process: Group Similar Vertices Together ■ Output: Clustered Graph ■ Graph Clustering is NP-Complete/Hard problem [Ref] [Ref] Survey on Graph Clustering, by Satu Elisa Schaeffer, Computer Science Review, 2007. Introduction Related Work Proposed Methodology Experiments Conclusion
  • 7.
    7 Challenges ■ Big RealWorld Graphs  Graph encodes rich relationships [Ref1] ■ Non-trivial Memory Requirements  1 Million vertices  3725 GB  1 Billion vertices  3.7 × 109 GB • Suppose, a node pair similarity value storage cost is 4 bytes ■ Graphs with Multiple Attributes ■ Bottleneck!!!  N2 computations for similarity [Ref1] Managing and Mining Billion-Node Graphs by Haixun Wang (Microsoft Research Asia) in KDD 2012 Summer School [Ref2] manyeyes.alphaworks.ibm.com Coauthor Network of Top 200 Authors on TEL from DBLP [Ref2] Introduction Related Work Proposed Methodology Experiments Conclusion
  • 8.
    8 Problem Statement ■ OptimizedGraph Clustering? Given a huge graph 𝐺, find 𝐾 number of clusters 𝐶 = {𝑐1, 𝑐2, … , 𝑐𝑖, … , 𝑐 𝐾} efficiently such that the quality (in terms of Density and Entropy) of the resulting clusters is maximized  𝑐𝑖 = 𝑣1, … 𝑣𝑗, … , 𝑣 𝑚 = 𝑠𝑒𝑡 𝑜𝑓 𝑚 𝑣𝑒𝑟𝑡𝑖𝑐𝑒𝑠 where 𝑣𝑗 is a single vertex  𝐺 = an entire graph  Density estimates the strong connectivity among nodes in each cluster  The semantic resemblance among vertices is determined by Entropy ■ Informal: Optimizations and Analysis for Efficiency  Can we solve the graph clustering problem with less pair-wise similarity computations (i.e. << N2)?  Is there any redundant computation, if yes then how to avoid it?  Do we really need to visit the entire graph repeatedly? Introduction Related Work Proposed Methodology Experiments Conclusion
  • 9.
    9 Related Work ■ Scalability Parallel and Distributed Frameworks [1,2] • Overhead of inter-process/machine communication  Disk based approach for single PC [3,4] • Disk data management overhead: Extensive disk I/O • Node searching is hard, so not suitable for clustering  Relational Approach [5] • Limited to fundamental graph operations ■ Complexity  Restricted Neighborhood Information [6-9]  Reduced Graph [10-13]  Using Evolutionary Techniques [14,15]  Sampling based Heuristics [16,17]  Limitation: Approximate / Non Exact Solutions Introduction Related Work Proposed Methodology Experiments Conclusion
  • 10.
    10 Proposed Methodology: Ata Glance The time complexity of pair-wise similarity computation is high, i.e., O(n3) The computations are overlapped where vertices/edges are visited repeatedly Unnecessary regions of the graph are traversed Computing set of shortest paths sequentially increase the overall latency CSM A collaborative similarity measure based on shortest path strategy O(n2logn) SPORE A novel concept is introduced to pre-compute and reuse necessary computations Introduction Related Work Proposed Methodology Experiments Conclusion Confined Traversal The graph is physically partitioned cluster-wise to limit graph traversals Set-based Approach The set of shortest path queries are computed simultaneously 1 2 3 4 ChallengesOptimizations
  • 11.
    11 Optimization 1: CollaborativeSimilarity Measure (CSM) ■ Objective  A good balance between Structure and Attribute  It should be Scalable to moderate size graphs ■ Idea • Define pair-wise similarity based on similar neighborhood • Consider single path (shortest path) strategy over all paths to reduce the complexity r1. XML r2. XMLr3. XML, Skyline r4. XML r5. XML r6. XML r7. XML r8. XML r9. Skyline r10. Skyline r11. Skyline Structural/Attribute Cluster r1. XML r2. XMLr3. XML, Skyline r4. XML r5. XML r6. XML r7. XML r8. XML r9. Skyline r10. Skyline r11. Skyline Attribute-based Cluster r1. XML r2. XMLr3. XML, Skyline r4. XML r5. XML r6. XML r7. XML r8. XML r9. Skyline r10. Skyline r11. Skyline Structure-based Cluster r1. XML r2. XMLr3. XML, Skyline r4. XML r5. XML r6. XML r7. XML r8. XML r9. Skyline r10. Skyline r11. Skyline Traditional Coauthor graph Introduction Related Work Proposed Methodology Experiments Conclusion 2 Target Source Shortest Path 1 3 4 5 6 Images source: http://www.slideshare.net/ShiningStar786/presentation-on-graph-clustering-vldb-09
  • 12.
    12 Optimization 2: ShortestPath Overlapped Region (SPORE) ■ Objective  Prove the existence of shortest path overlaps  Avoid redundant/overlapped traversals ■ Idea  Identify and analyze the intersections among set of shortest paths  Maintain shortest paths (as shortcut) from current traversal for reuse • Intuition: Sub-paths of shortest paths are also shortest paths [Ref] • Assumption: Recently visited vertices are expected to be visited again [Ref] Introduction to Algorithms, 2nd ed., (Cormen, Leiserson, Rivest, and Stein) 2001, p. 327. Original Graph G SPOREs Extracted from S P-Tree Rooted at A Augmented Graph G’ Introduction Related Work Proposed Methodology Experiments Conclusion A C D E B C D F overlap SPAE SPBF
  • 13.
    13 Optimization 3: ConfinedTraversals ■ Objective  Avoid traversing unnecessary parts of the original graph ■ Idea  Perform the SSSP on the sub-graph by neglecting the edges which contain at least one vertex from other cluster ■ Application: To update the centroids efficiently during clustering  Compute the shortest paths from each vertex to all the other vertices in the same cluster, • i.e., APSP(clusteri) = SSSP(Vj,clusteri) where j = 1,2, … , |clusteri| D B A C 3 2 1 2 1 3I H GE F 2 8 2 3 9 1 4 4 3 3 Graph G at Iteration t D B A C 3 2 1 2 1 3I H GE F 2 8 2 3 9 1 4 g1 g2 SSSP(A, g1) ~ SSSP(A, G), where g1 ⊂ G Introduction Related Work Proposed Methodology Experiments Conclusion
  • 14.
    14 Optimization 4: Parallelism ■Objective  Allow multiple SP traversal queries to be processed in parallel to reduce the overall latency ■ Idea  Manage the information for each traversal instance independently • Keep the source vertex information along with its intermediate data ■ Open Challenge  Explosion of intermediate path information for set-based approach !!!  It becomes even worse for certain types of graphs, e.g., power-law Original Toy Graph Element Approach Set Approach Introduction Related Work Proposed Methodology Experiments Conclusion
  • 15.
    15 Environment for Experiments ■Datasets  Real Graphs from Stanford Library1, Newman’s Network2, Enron Email Corpus5, Political-blogs network4, Synthetic3 ■ Approaches for Comparison  Collaborative Similarity Measure • SA-Cluster, S-Cluster, W-Cluster, K-SNAP  SPORE and Confined Traversal • SegTable, Hybrid (SPORE+SegTable) ■ Evaluation Criteria  Cluster Quality: Density, Entropy 1. Stanford Collection: http://snap.stanford.edu/data/ 2. Scientific Collaboration Network: http://toreopsahl.com/datasets/newman2001/ 3. Santo Fortunatos Graph Generator: http://santo.fortunato.googlepages.com/inthepress2/ 4. http://www-personal.umich.edu/mejn/netdata 5. http://www.edrm.net/resources/data-sets/edrm-enron-email-data-set-v2 Introduction Related Work Proposed Methodology Experiments Conclusion
  • 16.
    16 Optimization 1 ExperimentalResults: CSM ■ The proposed CSM [18] achieves competitive results efficiently Introduction Related Work Proposed Methodology Experiments Conclusion ExecutionTimeQuality
  • 17.
    17 ■ SPORE [19]is able to produce an order of magnitude less number of shortcuts as pre- computed information efficiently ■ Pre-computed information is effective Optimization 2 Experimental Results: SPORE Collaboration Facebook Email Generic graph processing model Introduction Related Work Proposed Methodology Experiments Conclusion EfficiencyEffectiveness
  • 18.
    18 Optimization 3 &4 Experimental Results: Confined Traversal ■ Confined Traversal [19] improves the execution time significantly (7~10 times) ■ We observe marginal differences in SP distance values for both strategies ■ Execution of multiple SP queries simultaneously reduces the overall latency by at least 50% Time and Expansion Analysis Effectiveness of the Restricted Traversal (sub) over Original Graph (full) Analyzing the Impact of Parallelism Introduction Related Work Proposed Methodology Experiments Conclusion
  • 19.
    19 Contributions and FutureWork ■ Provide Efficient Similarity Measure  Improves the time complexity, O(n3)  O(n2logn), based on shortest path strategy ■ Suggest Graph Traversal Optimizations towards Clustering  Avoid Redundant Computations: Time 40% speedup, Space an order of magnitude  Redefine the Search Space: 10 times faster  Computation Independence: Latency reduction 7~10 times ■ Future Research Directions  Avoid the intermediate data explosion during parallel graph traversals  Take into consideration the heterogeneity of large graphs Introduction Related Work Proposed Methodology Experiments Conclusion
  • 20.
    20 References [1] Bryan Perozziet al., “Scalable-Graph-Clustering-with-Pregel”, Complex Networks, 2013. [2] CL Staudt et al., ”Engineering High-Performance Community Detection Heuristics for Massive Graphs”, ICCP, 2013. [3] P Sarkar et al., “Fast Nearest-neighbor Search in Disk-resident Graphs”, KDD, 2010. [4] A Kyrola et al., “GraphChi: Large-Scale Graph Computation on Just a PC”. OSDI. Vol. 12. 2012. [5] J Gao et al., “Relational Approach for Shortest Path Discovery over Large Graphs”, VLDB, 2012. [6] X Fu et al., “Threshold Random Walkers for Community Detection”, Journal of Software, 2013. [7] X Qi et al., “Optimal local community detection in social networks based on density drop of subgraphs”, Pattern Recognition Letters, 2014. [8] D Delling et al., “Robust Exact Distance Queries on Massive Networks”, Report by Microsoft, 2014. [9] DA Spielman et al., “A Local Clustering Algorithm for Massive Graphs and Its Application to Nearly Linear Time Graph Partitioning”, SIAM J. on Computing, 2013. [10] H Shiokawa et al., “Fast Algorithm for Modularity-Based Graph Clustering”, AAAI, 2013. [11] J Feng et al., “Compression-based Graph Mining Exploiting Structure Primitives”, ICDM, 2013. [12] JF Rodrigues et al., “Large Graph Analysis in the GMine System”, TKDE, 2013. [13] Y Ruan et al., “Efficient community detection in large networks using content and links”, WWW, 2013. [14] Y Yoon et al., “Vertex Ordering, Clustering, and Their Application to Graph Partitioning”, Applied Mathematics and information sciences, 2014. [15] SR Mandala et al., Clustering social networks using ant colony optimization, Operational Research, 2011. [16] B Yang et al., “Hierarchical community detection with applications to real-world network analysis”, DKE, 2013. [17] I Rytsareva et al., “Scalable heuristics for clustering biological graphs”, IEEE Computational Advances in Bio and Medical Sciences (ICCABS), 2013. [18] Waqas Nawaz, Kifayat-Ullah Khan, Young-Koo Lee, and Sungyoung Lee, "Intra Graph Clustering using Collaborative Similarity Measure", Journal of Distributed and Parallel Databases (SCIE, IF 1.0), 2015. [19] Waqas Nawaz, Kifayat-Ullah Khan, and Young-Koo Lee, " SPORE: Shortest Path Overlapped Regions and Confined Traversals Towards Graph Clustering", Applied Intelligence-APIN (SCI, IF 1.85), 2015.
  • 21.

Editor's Notes

  • #4 Introduction Graph Traversals, Importance or usage, Types of Traversals (BFS, DFS, Shortest Path)
  • #5 References Matsuo et al. (2006) presented a graph clustering algorithm for word clustering based on word similarity measures by web counts Ichioka and Fukumoto (2008) applied similar approach as Matsuo et al. (2006) for Japanese Onomatopoetic word clustering, and showed that the approach outperforms 𝑘𝑘-means clustering by 16.2%. Dorow and Widdows (2003) built a co-occurrence graph in which each node represents a noun and two nodes have an edge between them if they co-occur more than a given threshold. They then applied Markov Clustering algorithm (MCL) Véronis (2004) proposed a graph based model named HyperLex based on the small-world properties of co-occurrence graphs Agirre (2007) proposed another method based on PageRank for finding hubs Survey (Workshop-2010) Graph-based clustering for computational linguistics_ a survey Shortest path Based Methods Betweenness (Girvan and Newman, 2003) modularity 𝑂𝑂(|𝑉𝑉||𝐸𝐸|2) information centrality (Fortunato et al., 2004) modularity 𝑂𝑂(|𝑉𝑉||𝐸𝐸|3) (Derry’s MS Thesis 2008) GRAPHS, CLUSTERING AND APPLICATIONS Valdis work for Politics REf: http://www.orgnet.com/divided.html Yao Zhao, Yinglian Xie, Fang Yu, Qifa Ke, Yuan Yu, Yan Chen, and Eliot Gillum. 2009. BotGraph: large scale spamming botnet detection. In Proceedings of the 6th USENIX symposium on Networked systems design and implementation (NSDI'09). USENIX Association, Berkeley, CA, USA, 321-334. (PLOS ONE 2014) Exploring Function Prediction in Protein Interaction Networks via Clustering Methods, Kire Trivodaliev1, Aleksandra Bogojeska1, Ljupco Kocarev1,2* (ICIBM-2013) Computational drug repositioning through Heterogeneous Netwrok Clustering Esuli A. and Sebastiani F., SentiWordNet: A Publicly Available Lexical Resource for Opinion Mining, In Proceedings of LREC-06, 5th Conference on Language Resources and Evaluation, 2006. Esuli A. and Sebastiani F., PageRanking WordNet synsets: An application to opinion mining, in Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics (ACL 2007) Prague, CZ, 2007. Turney P.D., Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews, Proceedings of the 40th ACL, 2002. Ding X. and Liu B., the Utility of Linguistic Rules in Opinion Mining, SIGIR, 2007. (Thesis 2008) GRAPHS, CLUSTERING AND APPLICATIONS Nicolae and Nicolae (2006) proposed a new quality measure named BESTCUT Chen and Ji (2009a) applied normalized spectral algorithm to conduct event coreference resolution
  • #12 Example Source: https://wiki.engr.illinois.edu/download/attachments/186384385/VLDB09_notes.ppt In this paper, we will study the problem of “An Intra-Graph Clustering Based on Collaborative Similarity Measure”. Two fold objectives are: A desired clustering should achieve a good balance between the following two properties: The first is structural cohesiveness, which means vertices within one cluster are close to each other in terms of structure, while vertices between clusters are distant from each other. The second is attribute homogeneity, which says vertices within one cluster have similar attribute values, while vertices between clusters have quite different attribute values. And should be scalable to medium (and large) scale graphs [in terms of time complexity without compromising on the quality of the results].
  • #16 Stanford Collection: http://snap.stanford.edu/data/ Scientific Collaboration Network: http://toreopsahl.com/datasets/newman2001/ Santo Fortunatos Graph Generator: http://santo.fortunato.googlepages.com/inthepress2/