- 1. Joint Work with: Jen Neville Ryan Rossi Nick Duffield
- 2. - - - - - Social network Human Disease Network [Barabasi 2007] Food Web [2007] Terrorist Network [Krebs 2002]Internet (AS) [2005] Gene Regulatory Network [Decourty 2008] Protein Interactions [breast cancer] Political blogs Power grid
- 3. (1) Motif Counting Theory and Algorithms for Large Graphs (2) Machine Learning Applications for Motif Counting
- 4. § Network motifs (graphlets) are small induced subgraphs § Motifs represent patterns of inter-connections, basic elements in complex networks CliqueTriangle CycleEdge
- 5. A study of motif frequencies in real-world complex networks Applied to food web, genetic, neural, web, and other network - Found distinct motifs in each case
- 6. Red dashed lines indicate edges that participate in a feed-forward loop Motifs are over-expressed in real complex networks
- 7. Motifs occur in real networks with frequencies significantly higher than randomly generated networks Real networks often have modular structure Cj# Ci# Ck# Concentration of Motifs in real vs random networks Two Conclusions
- 8. Nodes/edges are not the fundamental units of real networks Motifs are the building blocks of these networks We should analyze/model graphs in terms of motifs
- 9. Network Motifs: Simple Building Blocks of Complex Networks – [Milo et. al – Science 2002] The Structure and Function of Complex Networks – [Newman – Siam Review 2003] 2-node Graphlets 3-node Graphlets 4-node Graphlets Connected Disconnected Motifs/graphlets are Small k-vertex induced subgraphs
- 11. Ex: Given an input graph G - How many triangles in G? - How many cliques of size 4-nodes in G? - How many cycles of size 4-nodes in G?
- 12. Ex: Given an input graph G - How many triangles in G? - How many cliques of size 4-nodes in G? - How many cycles of size 4-nodes in G? à In practice, we would like to count all k-vertex graphlets
- 13. § Enumerate all possible graphlets
- 14. § Enumerate all possible graphlets à Exhaustive enumeration is too expensive
- 15. § Enumerate all possible graphlets à Exhaustive enumeration is too expensive § Count graphlets for each node – and combine all node counts [Shervashidze et. al – AISTAT 2009]
- 16. § Enumerate all possible graphlets à Exhaustive enumeration is too expensive § Count graphlets for each node – and combine all node counts à Still expensive for relatively large k [Shervashidze et. al – AISTAT 2009]
- 17. § Enumerate all possible graphlets à Exhaustive enumeration is too expensive § Count graphlets for each node – and combine all node counts à Still expensive for relatively large k [Shervashidze et. al – AISTAT 2009] § Other recent work counts only connected graphlets of size k=4 [Marcus & Shavitt – Computer Networks 2012] Not practical – scales only for small graphs with few hundred/thousand nodes/edges - taking hours for a graph with 30K nodes
- 18. Most work focused on graphlets of k=3 nodes In this work, we focus on graphlets of k=3,4 nodes Efficient Graphlet Counting for Large Networks [Ahmed et al., ICDM 2015] Graphlet Decomposition: Framework, Algorithms, and Applications [Ahmed et al., KAIS Journal 2016]
- 19. Edge-centric, Parallel, Fast, Space-efficient Framework u v v2 v3v1 v4 v6 v7 edge Edge Neighborhood Jointly count motifs
- 20. Searching Edge Neighborhoods ① For each edge do u v v2 v3v1 v4 v6 v7 edge
- 21. Searching Edge Neighborhoods ① For each edge do • Count All 3-node graphlets ② Merge counts from all edges u v v2 v3v1 v4 v6 v7 edge Triangle 2-star 1-edge Independent
- 22. Searching Edge Neighborhoods ① For each edge do • Count All 3-node graphlets ② Merge counts from all edges u v v2 v3v1 V4 v6 v7 edge Triangle 2-star 1-edge Independent q We only need to find/count triangles q Use equations to get counts of others in o(1) Triangle
- 23. How to count all 4-node graphlets? 4-Clique 4-Cycle4-Chrodal-Cycle Tailed-triangle 4-Path 3-Star 4-node-triangle 4-node-2star 4-node-2edge 4-node-1edge Independent
- 24. ± 1 edge 4-Node Motif Transition Diagram
- 25. Step 1 Step 2 Step 3 Searching Edge Neighborhoods For each edge Find the triangles Count 4-node graphlets For each edge Count 4-node cliques and 4-node cycles only Count 4-node graphlets For each edge Use combinatorial relationships to compute counts of other graphlets in constant time Step 4 Merge counts from all edges
- 26. ± 1 edge 4-Node Motif Transition Diagram
- 27. 4-Node Graphlet Transition Diagram ± 1 edge 4-Cliques 4-Cycles
- 28. ± 1 edge 4-Node Motif Transition Diagram Count a few patterns Use relationships & transitions to count all other graphlets in constant time ✓ Fast ✓ Space-Efficient
- 29. 4-Node Motif Transition Diagram ± 1 edge 4-Cliques 4-Cycles Maximum no. triangles Incident to an edge Maximum no. stars Incident to an edge Count a few patterns Use relationships & transitions to count all other graphlets in constant time
- 30. T T Relationship between 4-cliques & 4-ChordalCycles 4-Cliques 4-ChordalCycle e T T e No. 4-ChordalCycles No. 4-Cliques Proof in Lemma 1 - Ahmed et al., ICDM 2015
- 31. T T Relationship between 4-cliques & 4-ChordalCycles T T No. 4-ChordalCycles No. 4-Cliques 4-Cliques 4-ChordalCycle e e Proof in Lemma 1 - Ahmed et al., ICDM 2015
- 32. 0 1 2 4 8 16 0 5 10 15 Number of Processing Units Speedup socfb−Texas socfb−OR socfb−UCLA socfb−Berkeley13 socfb−MIT socfb−Penn94 0 1 2 4 8 16 0 5 10 15 Number of Processing Units Speedup 0 1 2 4 8 16 0 2 4 6 8 10 12 14 Number of Processing Units Speedup tech−internet−as tech−WHOIS web−it−2004 web−spam 0 1 2 4 8 16 0 2 4 6 8 10 12 14 Number of Processing Units Speedup Strong scaling results Intel Xeon 3.10 Ghz E5-2687W server v3, 16 cores
- 33. Comparison to RAGE [Marcus & Shavitt – J. Computer Networks 2011] Facebook100 Networks from US Schools Ours RAGE Time in Seconds
- 34. |V| |E| Ours RAGE Time in Seconds Baseline (RAGE) did not finish for most graphs We take ~4.5 secs for web-google (4.3M edges) We take ~4 secs for inf-road-usa (29M edges) Most motif counts in orders of 106 – 1015
- 35. § Node-level and edge-level motif counting Role discovery, Relational Learning, Multi-label Classification
- 36. 0 2 4 6 8 10 x 10 4 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 Edges Timeinseconds 0 2 4 6 8 10 x 10 4 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 Edges Timeinseconds Key Observations The distribution of graphlet runtimes for edge neighborhoods obey a power-law.
- 37. 0 2 4 6 8 10 x 10 4 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 Edges Timeinseconds 0 2 4 6 8 10 x 10 4 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 Edges Timeinseconds Most edge neighborhoods are fast with runtimes that are approximately equal. Key Observations The distribution of graphlet runtimes for edge neighborhoods obey a power-law.
- 38. 0 2 4 6 8 10 x 10 4 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 Edges Timeinseconds 0 2 4 6 8 10 x 10 4 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 Edges Timeinseconds HOWEVER, a handful of neighborhoods are hard and take significantly longer. Most edge neighborhoods are fast with runtimes that are approximately equal. Key Observations The distribution of graphlet runtimes for edge neighborhoods obey a power-law. Neighborhood runtimes are power-lawed
- 39. 0 2 4 6 8 10 x 10 4 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 Edges Timeinseconds 0 2 4 6 8 10 x 10 4 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 Edges Timeinseconds HOWEVER, a handful of neighborhoods are hard and take significantly longer. Most edge neighborhoods are fast with runtimes that are approximately equal. Key Observations The distribution of graphlet runtimes for edge neighborhoods obey a power-law. Neighborhood runtimes are power-lawed QUESTION: How to reduce runtime? à Sampling
- 40. § [Horvitz-Thompson Estimation 1952] § Using HT estimator we scale the sampled count by probability p Count of graphlet i for edge j Estimated Count of graphlet i for edge j More details in the paper See Lemma 1
- 41. § [Hoeffding’s Inequality 1963] More details in the paper See Lemma 2
- 42. Sampled maximum degree in the graph Less than the actual maximum degree
- 43. Average relative error for top 1000 (highest degree) edges Sampling with replacement p = 0.01 GFD: Graphlet Frequency Distribution Kolmogorov distance (KS) L1 Normalized distance
- 44. 1 2 4 8 12 16 0 2 4 6 8 10 12 14 16 Number of processing units Speedup socfb−MIT bio−dmela soc−gowalla tech−RL−caida web−wikipedia09 1 2 4 8 12 16 0 2 4 6 8 10 12 14 16 Number of processing units Speedup Strong scaling results Using Intel Xeon E5-2687W v3 server, 16 cores
- 45. § Unbiased Estimation of Motif Counts 10 4 10 5 0.85 0.9 0.95 1 1.05 1.1 1.15 soc−orkut−dir Sample Size 10 4 10 5 0.85 0.9 0.95 1 1.05 1.1 1.15 soc−orkut−dir Sample Size x/y 10 4 10 5 0.9 0.95 1 1.05 1.1 1.15 soc−flickr Sample Size 10 4 10 5 0.9 0.95 1 1.05 1.1 1.15 soc−flickr Sample Size Estimation of counts of 4-vertex clique [Ahmed et. al, ICDM 2015, KAIS 2016, TNNLS 2018]
- 46. Counting a few motifs + Storing a few motif counts + Obtain other counts using combinatorial relationships (1) Fast and parallel (2) Space-efficient Two Conclusions Joint Motif Counting for Large Graphs
- 47. ✓ (1) Motif Counting Theory and Algorithms for Large Graphs (2) Machine Learning Applications for Motif Counting
- 48. § Biological Networks • network alignment, protein function prediction [Pržulj 2007][Milenković-Pržulj 2008] [Hulovatyy-Solava-Milenković 2014] [Shervashidze et al. 2009][Vishwanathan et al. 2010] § Social Networks • Triad analysis, role discovery, community detection [Granovetter 1983][Holland-Leinhardt 1976][Rossi-Ahmed 2015] [Ahmed et al. 2015][Xie-Kelley-Szymanski 2013] § Internet AS [Feldman et al. 2008] § Spam Detection [Becchetti et al. 2008][Ahmed et al. 2016] - - - - - Useful for various machine learning tasks e.g., Anomaly detection, Role Discovery, Relational Learning, Clustering etc.
- 49. § Higher-order network analysis § Graph classification § Higher-order Network Embeddings
- 50. Higher-order network structures § Visualization – “spotting anomalies” [Ahmed et al. ICDM 2015] § Finding large cliques, stars, and other larger network structures [Ahmed et al. KAIS 2015] § Spectral clustering [Benson et al. Science 2016] § Role discovery [Ahmed et al. 2016] ...
- 51. Ranking by graphlet counts Nodes are colored/weighted by triangle counts Links are colored/weighted by stars of size 4 nodes Leukemia Colon cancer Deafness
- 52. § Higher-order network analysis § Graph classification § Higher-order Network Embeddings
- 53. Label 1 Label 0 Enzyme Non-Enzyme Collection of Graphs (e.g. Protein Graphs) . . . Graphs Each Protein is represented by a graph Binary label represents the function of the protein
- 54. Label 1 Label 0 Enzyme Non-Enzyme ? ? . . . Graphs ? ? ? Collection of Graphs (e.g. Protein Graphs) Assume we know the labels of a few graphs How to predict the labels of the unlabeled graphs?
- 55. Features Graphs Motif/Graphlet Feature Extraction Model Learning Predict Labels of Unlabeled Graphs Label 1 Label 0 ? ? ? ? . . . Graphs ? ? ? Protein Graphs
- 56. § D&D – 1178 protein graphs. Binary labeled as Enzymes vs. Non. Enzymes § MUTAG – 188 mutagenic compounds. Binary labeled (whether or not they have a mutagenic effect on the Gram- negative bacterium) § 10-fold validation, Support Vector Machine § Used 3,4 node motif counts as features
- 57. § Higher-order network analysis § Graph classification § Higher-order Network Embeddings
- 58. input 0 … 1 … 0 … 0 … 2 0 … 1 Feature Engineering features 1 … 1 … 0 0 1 0 0 Learning AlgorithmModel Prediction Task Link prediction Classification Anomaly detection
- 59. input 0 … 1 … 0 … 0 … 2 0 … 1 Feature Engineering features 1 … 1 … 0 0 1 0 0 Learning AlgorithmModel Prediction Task Automatic Feature Learning Link prediction Classification Anomaly detection
- 60. § Goal: Learn representation (features) for a set of graph elements (nodes, edges, etc.) § Key intuition: Map the graph elements (e.g., nodes) to the d-dimension space, while preserving structural similarity § Use the features for any downstream prediction task
- 61. Given a graph G=(V,E) and a set of T network motifs, We form the weighted motif adjacency matrices # of instances of motif Ht that contain nodes i and j
- 63. To generalize HONE for any motif-based matrix formulation, we can define a function Motif Transition Matrix Laplacian Matrix Normalized Laplacian Matrix Motif Matrix Formulations
- 64. The number of paths weighted by motif counts from node i to node j in k- steps is The probability of transitioning from node i to node j in k-steps is given by We derive all k-step motif-based matrices for all T motifs and K steps ,""""for"k=1,…,K""and""t=1,…,T K-step Motif Matrix Local Embeddings
- 65. Concatenate all k-step node embedding for all T motifs and K steps and find a ”global” higher-order node embeddings by solving H D"×"TKD& Z N"×"D Each row of Z is a D-dimensional embedding of a node Global Embeddings
- 66. Experimental setup § 10-fold cross-validation, repeated for 10 random trials § Used all 2-4 node connected orbits § D=128, Dl = 16 for the local motif embeddings § Edge embedding derived via mean function § Predict link existence via logistic regression (LR) § # steps K selected via grid search over Main findings: § Mean Gain in AUC of 19.24% (& up to 75.21%)
- 67. ✓ (1) Motif Counting Theory and Algorithms for Large Graphs ✓ (2) Machine Learning Applications for Motif Counting
- 68. Motif Counting Higher-order network analysis Graph Classification Higher-order Network Embeddings Higher-order Clustering Role Discovery
- 69. § Framework & Algorithms • One of the first parallel approaches for graphlet counting • On average 460x faster than current methods • Edge-centric computations (only requires access to edge neighborhood) • Time and space-efficient • Sampling/estimation methods • Local/global counting § Applications • Large-scale graph comparison, classification, and anomaly detection • Visual analytics and real-time graphlet mining • Higher-order network embeddings
- 71. § Efficient Graphlet Counting for Large Networks. ICDM 2015, [Ahmed et al.] § Graphlet Decomposition: Framework, Algorithms, and Applications. J. Know. & Info. 2016 [Ahmed et al.] § Estimation of Graphlet Counts in Massive Networks. IEEE TNNLS 2018 [Rossi-Zhou-Ahmed] § Higher-order Network Representation Learning. WWW 2018 [Rossi-Ahmed-Koh] § Network Motifs: Simple Building Blocks of Complex Networks. Science 2002, [Milo et al.] § Uncovering Biological Network Function via Graphlet Degree Signatures. Cancer Informatics 2008 [Milenković-Pržulj] § Graph Kernels. JMLR 2010, [Vishwanathan et al.] § Graph Sample and Hold: A Framework for Big Graph Analytics. KDD 2014 [Ahmed-Duffield-Neville-Kompella] § Role Discovery in Networks. IEEE TKDE 2015 [Rossi-Ahmed] § The Structure and Function of Complex Networks. SIAM Review 2003, [Newman] § Biological network comparison using graphlet degree distribution. Bioinformatics 2007 [Pržulj] § Efficient Graphlet Kernels for Large Graph Comparison. AISTAT 2009 [Shervashidze et al.] § Revealing Missing Parts of the Interactome via Link Prediction. PLoS ONE 2014 [Hulovatyy-Solava-Milenković] § Graft: An Efficient Graphlet Counting Method for Large Graph Analysis. TKDE 2014 [Rahman-Bhuiyan-Hasan] § Graph-Based Anomaly Detection, KDD 2003 [Noble-Cook] § Local structure in social networks. Sociological methodology 1976, [Holland-Leinhardt] § The strength of weak ties: A network theory revisited. Sociological theory 1983 [Granovetter] § Overlapping community detection in networks. ACM Computing Surveys 2013 [Xie-Kelley-Szymanski] § Automatic large scale generation of internet pop level maps. GLOBECOM 2008 [Feldman et al.] § Efficient semi-streaming algorithms for local triangle counting in massive graphs. KDD 2008 [Becchetti et al.]