Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

The Power of Motif Counting Theory, Algorithms, and Applications for Large Graphs

26 views

Published on

Keynote presentation at GraML 2018, The Second Workshop on the Intersection of Graph Algorithms and Machine Learning, Co-located with IPDPS 2018.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

The Power of Motif Counting Theory, Algorithms, and Applications for Large Graphs

  1. 1. Joint Work with: Jen Neville Ryan Rossi Nick Duffield
  2. 2. - - - - - Social network Human Disease Network [Barabasi 2007] Food Web [2007] Terrorist Network [Krebs 2002]Internet (AS) [2005] Gene Regulatory Network [Decourty 2008] Protein Interactions [breast cancer] Political blogs Power grid
  3. 3. (1) Motif Counting Theory and Algorithms for Large Graphs (2) Machine Learning Applications for Motif Counting
  4. 4. § Network motifs (graphlets) are small induced subgraphs § Motifs represent patterns of inter-connections, basic elements in complex networks CliqueTriangle CycleEdge
  5. 5. A study of motif frequencies in real-world complex networks Applied to food web, genetic, neural, web, and other network - Found distinct motifs in each case
  6. 6. Red dashed lines indicate edges that participate in a feed-forward loop Motifs are over-expressed in real complex networks
  7. 7. Motifs occur in real networks with frequencies significantly higher than randomly generated networks Real networks often have modular structure Cj# Ci# Ck# Concentration of Motifs in real vs random networks Two Conclusions
  8. 8. Nodes/edges are not the fundamental units of real networks Motifs are the building blocks of these networks We should analyze/model graphs in terms of motifs
  9. 9. Network Motifs: Simple Building Blocks of Complex Networks – [Milo et. al – Science 2002] The Structure and Function of Complex Networks – [Newman – Siam Review 2003] 2-node Graphlets 3-node Graphlets 4-node Graphlets Connected Disconnected Motifs/graphlets are Small k-vertex induced subgraphs
  10. 10. Ex: Given an input graph G - How many triangles in G? - How many cliques of size 4-nodes in G? - How many cycles of size 4-nodes in G?
  11. 11. Ex: Given an input graph G - How many triangles in G? - How many cliques of size 4-nodes in G? - How many cycles of size 4-nodes in G? à In practice, we would like to count all k-vertex graphlets
  12. 12. § Enumerate all possible graphlets
  13. 13. § Enumerate all possible graphlets à Exhaustive enumeration is too expensive
  14. 14. § Enumerate all possible graphlets à Exhaustive enumeration is too expensive § Count graphlets for each node – and combine all node counts [Shervashidze et. al – AISTAT 2009]
  15. 15. § Enumerate all possible graphlets à Exhaustive enumeration is too expensive § Count graphlets for each node – and combine all node counts à Still expensive for relatively large k [Shervashidze et. al – AISTAT 2009]
  16. 16. § Enumerate all possible graphlets à Exhaustive enumeration is too expensive § Count graphlets for each node – and combine all node counts à Still expensive for relatively large k [Shervashidze et. al – AISTAT 2009] § Other recent work counts only connected graphlets of size k=4 [Marcus & Shavitt – Computer Networks 2012] Not practical – scales only for small graphs with few hundred/thousand nodes/edges - taking hours for a graph with 30K nodes
  17. 17. Most work focused on graphlets of k=3 nodes In this work, we focus on graphlets of k=3,4 nodes Efficient Graphlet Counting for Large Networks [Ahmed et al., ICDM 2015] Graphlet Decomposition: Framework, Algorithms, and Applications [Ahmed et al., KAIS Journal 2016]
  18. 18. Edge-centric, Parallel, Fast, Space-efficient Framework u v v2 v3v1 v4 v6 v7 edge Edge Neighborhood Jointly count motifs
  19. 19. Searching Edge Neighborhoods ① For each edge do u v v2 v3v1 v4 v6 v7 edge
  20. 20. Searching Edge Neighborhoods ① For each edge do • Count All 3-node graphlets ② Merge counts from all edges u v v2 v3v1 v4 v6 v7 edge Triangle 2-star 1-edge Independent
  21. 21. Searching Edge Neighborhoods ① For each edge do • Count All 3-node graphlets ② Merge counts from all edges u v v2 v3v1 V4 v6 v7 edge Triangle 2-star 1-edge Independent q We only need to find/count triangles q Use equations to get counts of others in o(1) Triangle
  22. 22. How to count all 4-node graphlets? 4-Clique 4-Cycle4-Chrodal-Cycle Tailed-triangle 4-Path 3-Star 4-node-triangle 4-node-2star 4-node-2edge 4-node-1edge Independent
  23. 23. ± 1 edge 4-Node Motif Transition Diagram
  24. 24. Step 1 Step 2 Step 3 Searching Edge Neighborhoods For each edge Find the triangles Count 4-node graphlets For each edge Count 4-node cliques and 4-node cycles only Count 4-node graphlets For each edge Use combinatorial relationships to compute counts of other graphlets in constant time Step 4 Merge counts from all edges
  25. 25. ± 1 edge 4-Node Motif Transition Diagram
  26. 26. 4-Node Graphlet Transition Diagram ± 1 edge 4-Cliques 4-Cycles
  27. 27. ± 1 edge 4-Node Motif Transition Diagram Count a few patterns Use relationships & transitions to count all other graphlets in constant time ✓ Fast ✓ Space-Efficient
  28. 28. 4-Node Motif Transition Diagram ± 1 edge 4-Cliques 4-Cycles Maximum no. triangles Incident to an edge Maximum no. stars Incident to an edge Count a few patterns Use relationships & transitions to count all other graphlets in constant time
  29. 29. T T Relationship between 4-cliques & 4-ChordalCycles 4-Cliques 4-ChordalCycle e T T e No. 4-ChordalCycles No. 4-Cliques Proof in Lemma 1 - Ahmed et al., ICDM 2015
  30. 30. T T Relationship between 4-cliques & 4-ChordalCycles T T No. 4-ChordalCycles No. 4-Cliques 4-Cliques 4-ChordalCycle e e Proof in Lemma 1 - Ahmed et al., ICDM 2015
  31. 31. 0 1 2 4 8 16 0 5 10 15 Number of Processing Units Speedup socfb−Texas socfb−OR socfb−UCLA socfb−Berkeley13 socfb−MIT socfb−Penn94 0 1 2 4 8 16 0 5 10 15 Number of Processing Units Speedup 0 1 2 4 8 16 0 2 4 6 8 10 12 14 Number of Processing Units Speedup tech−internet−as tech−WHOIS web−it−2004 web−spam 0 1 2 4 8 16 0 2 4 6 8 10 12 14 Number of Processing Units Speedup Strong scaling results Intel Xeon 3.10 Ghz E5-2687W server v3, 16 cores
  32. 32. Comparison to RAGE [Marcus & Shavitt – J. Computer Networks 2011] Facebook100 Networks from US Schools Ours RAGE Time in Seconds
  33. 33. |V| |E| Ours RAGE Time in Seconds Baseline (RAGE) did not finish for most graphs We take ~4.5 secs for web-google (4.3M edges) We take ~4 secs for inf-road-usa (29M edges) Most motif counts in orders of 106 – 1015
  34. 34. § Node-level and edge-level motif counting Role discovery, Relational Learning, Multi-label Classification
  35. 35. 0 2 4 6 8 10 x 10 4 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 Edges Timeinseconds 0 2 4 6 8 10 x 10 4 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 Edges Timeinseconds Key Observations The distribution of graphlet runtimes for edge neighborhoods obey a power-law.
  36. 36. 0 2 4 6 8 10 x 10 4 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 Edges Timeinseconds 0 2 4 6 8 10 x 10 4 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 Edges Timeinseconds Most edge neighborhoods are fast with runtimes that are approximately equal. Key Observations The distribution of graphlet runtimes for edge neighborhoods obey a power-law.
  37. 37. 0 2 4 6 8 10 x 10 4 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 Edges Timeinseconds 0 2 4 6 8 10 x 10 4 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 Edges Timeinseconds HOWEVER, a handful of neighborhoods are hard and take significantly longer. Most edge neighborhoods are fast with runtimes that are approximately equal. Key Observations The distribution of graphlet runtimes for edge neighborhoods obey a power-law. Neighborhood runtimes are power-lawed
  38. 38. 0 2 4 6 8 10 x 10 4 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 Edges Timeinseconds 0 2 4 6 8 10 x 10 4 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 Edges Timeinseconds HOWEVER, a handful of neighborhoods are hard and take significantly longer. Most edge neighborhoods are fast with runtimes that are approximately equal. Key Observations The distribution of graphlet runtimes for edge neighborhoods obey a power-law. Neighborhood runtimes are power-lawed QUESTION: How to reduce runtime? à Sampling
  39. 39. § [Horvitz-Thompson Estimation 1952] § Using HT estimator we scale the sampled count by probability p Count of graphlet i for edge j Estimated Count of graphlet i for edge j More details in the paper See Lemma 1
  40. 40. § [Hoeffding’s Inequality 1963] More details in the paper See Lemma 2
  41. 41. Sampled maximum degree in the graph Less than the actual maximum degree
  42. 42. Average relative error for top 1000 (highest degree) edges Sampling with replacement p = 0.01 GFD: Graphlet Frequency Distribution Kolmogorov distance (KS) L1 Normalized distance
  43. 43. 1 2 4 8 12 16 0 2 4 6 8 10 12 14 16 Number of processing units Speedup socfb−MIT bio−dmela soc−gowalla tech−RL−caida web−wikipedia09 1 2 4 8 12 16 0 2 4 6 8 10 12 14 16 Number of processing units Speedup Strong scaling results Using Intel Xeon E5-2687W v3 server, 16 cores
  44. 44. § Unbiased Estimation of Motif Counts 10 4 10 5 0.85 0.9 0.95 1 1.05 1.1 1.15 soc−orkut−dir Sample Size 10 4 10 5 0.85 0.9 0.95 1 1.05 1.1 1.15 soc−orkut−dir Sample Size x/y 10 4 10 5 0.9 0.95 1 1.05 1.1 1.15 soc−flickr Sample Size 10 4 10 5 0.9 0.95 1 1.05 1.1 1.15 soc−flickr Sample Size Estimation of counts of 4-vertex clique [Ahmed et. al, ICDM 2015, KAIS 2016, TNNLS 2018]
  45. 45. Counting a few motifs + Storing a few motif counts + Obtain other counts using combinatorial relationships (1) Fast and parallel (2) Space-efficient Two Conclusions Joint Motif Counting for Large Graphs
  46. 46. ✓ (1) Motif Counting Theory and Algorithms for Large Graphs (2) Machine Learning Applications for Motif Counting
  47. 47. § Biological Networks • network alignment, protein function prediction [Pržulj 2007][Milenković-Pržulj 2008] [Hulovatyy-Solava-Milenković 2014] [Shervashidze et al. 2009][Vishwanathan et al. 2010] § Social Networks • Triad analysis, role discovery, community detection [Granovetter 1983][Holland-Leinhardt 1976][Rossi-Ahmed 2015] [Ahmed et al. 2015][Xie-Kelley-Szymanski 2013] § Internet AS [Feldman et al. 2008] § Spam Detection [Becchetti et al. 2008][Ahmed et al. 2016] - - - - - Useful for various machine learning tasks e.g., Anomaly detection, Role Discovery, Relational Learning, Clustering etc.
  48. 48. § Higher-order network analysis § Graph classification § Higher-order Network Embeddings
  49. 49. Higher-order network structures § Visualization – “spotting anomalies” [Ahmed et al. ICDM 2015] § Finding large cliques, stars, and other larger network structures [Ahmed et al. KAIS 2015] § Spectral clustering [Benson et al. Science 2016] § Role discovery [Ahmed et al. 2016] ...
  50. 50. Ranking by graphlet counts Nodes are colored/weighted by triangle counts Links are colored/weighted by stars of size 4 nodes Leukemia Colon cancer Deafness
  51. 51. § Higher-order network analysis § Graph classification § Higher-order Network Embeddings
  52. 52. Label 1 Label 0 Enzyme Non-Enzyme Collection of Graphs (e.g. Protein Graphs) . . . Graphs Each Protein is represented by a graph Binary label represents the function of the protein
  53. 53. Label 1 Label 0 Enzyme Non-Enzyme ? ? . . . Graphs ? ? ? Collection of Graphs (e.g. Protein Graphs) Assume we know the labels of a few graphs How to predict the labels of the unlabeled graphs?
  54. 54. Features Graphs Motif/Graphlet Feature Extraction Model Learning Predict Labels of Unlabeled Graphs Label 1 Label 0 ? ? ? ? . . . Graphs ? ? ? Protein Graphs
  55. 55. § D&D – 1178 protein graphs. Binary labeled as Enzymes vs. Non. Enzymes § MUTAG – 188 mutagenic compounds. Binary labeled (whether or not they have a mutagenic effect on the Gram- negative bacterium) § 10-fold validation, Support Vector Machine § Used 3,4 node motif counts as features
  56. 56. § Higher-order network analysis § Graph classification § Higher-order Network Embeddings
  57. 57. input 0 … 1 … 0 … 0 … 2 0 … 1 Feature Engineering features 1 … 1 … 0 0 1 0 0 Learning AlgorithmModel Prediction Task Link prediction Classification Anomaly detection
  58. 58. input 0 … 1 … 0 … 0 … 2 0 … 1 Feature Engineering features 1 … 1 … 0 0 1 0 0 Learning AlgorithmModel Prediction Task Automatic Feature Learning Link prediction Classification Anomaly detection
  59. 59. § Goal: Learn representation (features) for a set of graph elements (nodes, edges, etc.) § Key intuition: Map the graph elements (e.g., nodes) to the d-dimension space, while preserving structural similarity § Use the features for any downstream prediction task
  60. 60. Given a graph G=(V,E) and a set of T network motifs, We form the weighted motif adjacency matrices # of instances of motif Ht that contain nodes i and j
  61. 61. Weighted-Motif Graph
  62. 62. To generalize HONE for any motif-based matrix formulation, we can define a function Motif Transition Matrix Laplacian Matrix Normalized Laplacian Matrix Motif Matrix Formulations
  63. 63. The number of paths weighted by motif counts from node i to node j in k- steps is The probability of transitioning from node i to node j in k-steps is given by We derive all k-step motif-based matrices for all T motifs and K steps ,""""for"k=1,…,K""and""t=1,…,T K-step Motif Matrix Local Embeddings
  64. 64. Concatenate all k-step node embedding for all T motifs and K steps and find a ”global” higher-order node embeddings by solving H D"×"TKD& Z N"×"D Each row of Z is a D-dimensional embedding of a node Global Embeddings
  65. 65. Experimental setup § 10-fold cross-validation, repeated for 10 random trials § Used all 2-4 node connected orbits § D=128, Dl = 16 for the local motif embeddings § Edge embedding derived via mean function § Predict link existence via logistic regression (LR) § # steps K selected via grid search over Main findings: § Mean Gain in AUC of 19.24% (& up to 75.21%)
  66. 66. ✓ (1) Motif Counting Theory and Algorithms for Large Graphs ✓ (2) Machine Learning Applications for Motif Counting
  67. 67. Motif Counting Higher-order network analysis Graph Classification Higher-order Network Embeddings Higher-order Clustering Role Discovery
  68. 68. § Framework & Algorithms • One of the first parallel approaches for graphlet counting • On average 460x faster than current methods • Edge-centric computations (only requires access to edge neighborhood) • Time and space-efficient • Sampling/estimation methods • Local/global counting § Applications • Large-scale graph comparison, classification, and anomaly detection • Visual analytics and real-time graphlet mining • Higher-order network embeddings
  69. 69. Code https://github.com/nkahmed/PGD Data http://networkrepository.com à Email me for questions
  70. 70. § Efficient Graphlet Counting for Large Networks. ICDM 2015, [Ahmed et al.] § Graphlet Decomposition: Framework, Algorithms, and Applications. J. Know. & Info. 2016 [Ahmed et al.] § Estimation of Graphlet Counts in Massive Networks. IEEE TNNLS 2018 [Rossi-Zhou-Ahmed] § Higher-order Network Representation Learning. WWW 2018 [Rossi-Ahmed-Koh] § Network Motifs: Simple Building Blocks of Complex Networks. Science 2002, [Milo et al.] § Uncovering Biological Network Function via Graphlet Degree Signatures. Cancer Informatics 2008 [Milenković-Pržulj] § Graph Kernels. JMLR 2010, [Vishwanathan et al.] § Graph Sample and Hold: A Framework for Big Graph Analytics. KDD 2014 [Ahmed-Duffield-Neville-Kompella] § Role Discovery in Networks. IEEE TKDE 2015 [Rossi-Ahmed] § The Structure and Function of Complex Networks. SIAM Review 2003, [Newman] § Biological network comparison using graphlet degree distribution. Bioinformatics 2007 [Pržulj] § Efficient Graphlet Kernels for Large Graph Comparison. AISTAT 2009 [Shervashidze et al.] § Revealing Missing Parts of the Interactome via Link Prediction. PLoS ONE 2014 [Hulovatyy-Solava-Milenković] § Graft: An Efficient Graphlet Counting Method for Large Graph Analysis. TKDE 2014 [Rahman-Bhuiyan-Hasan] § Graph-Based Anomaly Detection, KDD 2003 [Noble-Cook] § Local structure in social networks. Sociological methodology 1976, [Holland-Leinhardt] § The strength of weak ties: A network theory revisited. Sociological theory 1983 [Granovetter] § Overlapping community detection in networks. ACM Computing Surveys 2013 [Xie-Kelley-Szymanski] § Automatic large scale generation of internet pop level maps. GLOBECOM 2008 [Feldman et al.] § Efficient semi-streaming algorithms for local triangle counting in massive graphs. KDD 2008 [Becchetti et al.]
  71. 71. Thank you! Questions? nesreen.k.ahmed@intel.com http://nesreenahmed.com

×