Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Statistical inference of generative network models - Tiago P. Peixoto

803 views

Published on

COMPLEX NETWORKS: THEORY, METHODS, AND APPLICATIONS (2ND EDITION)
May 16-20, 2016

Published in: Science
  • Be the first to comment

  • Be the first to like this

Statistical inference of generative network models - Tiago P. Peixoto

  1. 1. Statistical inference of generative network models Tiago P. Peixoto Universität Bremen Germany ISI Foundation Turin, Italy Como, May 2016
  2. 2. LARGE-SCALE NETWORK STRUCTURE
  3. 3. PROBLEM: HOW TO DETECT AND CHARACTERIZE MODULAR STRUCTURE?
  4. 4. STRUCTURE VS. NOISE
  5. 5. STRUCTURE VS. NOISE
  6. 6. STRUCTURE VS. NOISE
  7. 7. STRUCTURE VS. NOISE
  8. 8. STRUCTURE VS. NOISE
  9. 9. STRUCTURE VS. NOISE
  10. 10. DIFFERENT METHODS, DIFFERENT RESULTS... Betweenness Modularity matrix Infomap Walk Trap Label propagation Modularity Modularity (Blondel) Spin glass Infomap (overlapping) Clique percolation
  11. 11. (XKCD, Randall Munroe) Are we stuck on this?
  12. 12. A PRINCIPLED APPROACH: GENERATIVE MODELS Before we devise algorithms, we must formulate probabilistic models for the formation of network structure. The parameters of the model describe the network structure. The actual values of the parameters are unknown beforehand, but can be inferred from data, using robust principles from statistics. Ultimately, we seek to: Reduce the complexity of network data. Access the statistical significance of the results, and thus differentiate structure from noise. Compare different models as alternative hypotheses. Generalize from data and make predictions.
  13. 13. NOTHING MORE THAN TRADITIONAL SCIENCE...
  14. 14. How do we model networks?
  15. 15. EXPONENTIAL RANDOM GRAPH MODEL (ERGM) G → Network m(G) → Observable (e.g. cluster- ing, community structure, etc.) We want a probabilistic model: P(G) Maximum entropy principle Ensemble entropy: S = − ∑G P(G) ln P(G) Maximize S conditioned on m = m∗ m = 1 N ∑ G m(G)P(G)
  16. 16. EXPONENTIAL RANDOM GRAPH MODEL (ERGM) G → Network m(G) → Observable (e.g. cluster- ing, community structure, etc.) We want a probabilistic model: P(G) Maximum entropy principle Ensemble entropy: S = − ∑G P(G) ln P(G) Maximize S conditioned on m = m∗ m = 1 N ∑ G m(G)P(G) Lagrange multipliers: Λ = − ∑ G P(G) ln P(G) − θ ∑ G m(G)P(G) − Nm∗ ∂Λ ∂P(G) = 0, ∂Λ ∂θ = 0 P(G|θ) = e−θm(G) Z(θ) Z(θ) = ∑ G e−θm(G) In Physics: ERGM → Gibbs ensemble m(G) → Hamiltonian θ → Inverse temperature Z(θ) → Partition function
  17. 17. ERGM WITH COMMUNITY STRUCTURE N nodes divided into B groups. Node partition, bi ∈ [1, B] Observables: Number of edges between groups r and s, ers = ∑ij Aijδbi,rδbj,s P(G|θ) = exp − ∑r≤s θrsers Z(θ) = ∏ i<j e −Aijθbibj e −θbibj + 1 prs = 1 eθrs + 1 P(G|b, θ) = ∏ i<j p Aij bibj (1 − pbibj )1−Aij prs → Probability of an edge existing between two nodes of groups r and s. ers = nrnsprs a.k.a. the stochastic block model (SBM)
  18. 18. THE STOCHASTIC BLOCK MODEL (SBM) P. W. HOLLAND ET AL., SOC NETWORKS 5, 109 (1983) Large-scale modules: N nodes divided into B groups. Parameters: bi → group membership of node i prs → probability of an edge between nodes of groups r and s. s r → Properties: General model for large-scale structures. Traditional assortative communities are a special case, but it also admits arbitrary mixing patterns (e.g. bipartite, core-periphery, etc). Formulation for directed graphs is trivial. The meaning of “communities” or “groups” is well defined in the model.
  19. 19. BASIC SBM VARIATIONS Bernoulli (simple graphs) P(G|b, θ) = ∏ i<j p Aij bibj (1 − pbibj )1−Aij (Holland et al, 1983) Poisson (multigraphs) P(G|b, λ) = ∏ i<j e −λbibj λ Aij bibj Aij! (Karrer and Newman, 2011) Microcanonical P(G|b, {ers}) = 1 Ω({ers}, {nr}) Simple graphs Ω({ers}, {nr}) = ∏ r<s nrns ers ∏ r (nr 2 ) err/2 Multigraphs Ω({ers}, {nr}) = ∏ r<s nrns ers ∏ r nr 2 err/2 (Peixoto, 2012)
  20. 20. PARAMETRIC INFERENCE VIA MAXIMUM LIKELIHOOD Data likelihood: P(G|θ) G → Observed network θ → Model parameters: {prs}, {bi} s r θ −→ P(G|θ) G
  21. 21. PARAMETRIC INFERENCE VIA MAXIMUM LIKELIHOOD Data likelihood: P(G|θ) G → Observed network θ → Model parameters: {prs}, {bi} G
  22. 22. PARAMETRIC INFERENCE VIA MAXIMUM LIKELIHOOD Data likelihood: P(G|θ) G → Observed network θ → Model parameters: {prs}, {bi} G
  23. 23. PARAMETRIC INFERENCE VIA MAXIMUM LIKELIHOOD Data likelihood: P(G|θ) G → Observed network θ → Model parameters: {prs}, {bi} ←− argmax θ P(G|θ) G
  24. 24. PARAMETRIC INFERENCE VIA MAXIMUM LIKELIHOOD Data likelihood: P(G|θ) G → Observed network θ → Model parameters: {prs}, {bi} ˆθ ←− argmax θ P(G|θ) G
  25. 25. PARAMETRIC INFERENCE VIA MAXIMUM LIKELIHOOD Data likelihood: P(G|θ) G → Observed network θ → Model parameters: {prs}, {bi} ˆθ ←− argmax θ P(G|θ) G
  26. 26. EFFICIENT MCMC INFERENCE ALGORITHM T. P. P., PHYS. REV. E 89, 012804 (2014) Idea: Use the currently-inferred structure to guess the best next move. Choose a random vertex v (happens to belong to block r). Move it to a random block s ∈ [1, B], chosen with a probability p(r → s|t) proportional to ets + , where t is the block membership of a randomly chosen neighbour of v. Accept the move with probability a = min e−β∆S ∑t pi tp(s → r|t) ∑t pi tp(r → s|t) , 1 . Repeat. i bi = r j bj = t etr ets etur t s u
  27. 27. EFFICIENT MCMC INFERENCE ALGORITHM One remaining problem: Metastable states... 0 200 400 600 800 1000 1200 MCMC Sweeps −1.2 −1.0 −0.8 −0.6 −0.4 −0.2 0.0St T.P.P., Phys. Rev. E 89, 012804 (2014)
  28. 28. EFFICIENT MCMC INFERENCE ALGORITHM Solution: Agglomerative clustering. →
  29. 29. EFFICIENT MCMC INFERENCE ALGORITHM Solution: Agglomerative clustering. → By making β → ∞ we get a very reliable greedy heuristic!
  30. 30. EFFICIENT MCMC INFERENCE ALGORITHM Solution: Agglomerative clustering. → By making β → ∞ we get a very reliable greedy heuristic! Algorithmic complexity: O(N ln2 N) (independent of B)
  31. 31. INFERENCE = BED OF ROSES There are some caveats...
  32. 32. PROBLEM 1: BROAD DEGREE DISTRIBUTIONS BRIAN KARRER AND M. E. J. NEWMAN, PHYS. REV. E 83, 016107, 2011 Political Blogs (Adamic and Glance, 2005)
  33. 33. PROBLEM 1: BROAD DEGREE DISTRIBUTIONS BRIAN KARRER AND M. E. J. NEWMAN, PHYS. REV. E 83, 016107, 2011 Political Blogs (Adamic and Glance, 2005)
  34. 34. PROBLEM 1: BROAD DEGREE DISTRIBUTIONS BRIAN KARRER AND M. E. J. NEWMAN, PHYS. REV. E 83, 016107, 2011 Political Blogs (Adamic and Glance, 2005) Degree-corrected SBM pij = θiθjλbi,bj λrs → edges between groups θi → propensity of node i P(G|b, θ, λ) = ∏ i<j e −θiθjλbibj θiθjλbibj Aij Aij! ˆθi = ki ebi ˆλrs = ers
  35. 35. PROBLEM 1: BROAD DEGREE DISTRIBUTIONS BRIAN KARRER AND M. E. J. NEWMAN, PHYS. REV. E 83, 016107, 2011 Political Blogs (Adamic and Glance, 2005) Degree-corrected SBM pij = θiθjλbi,bj λrs → edges between groups θi → propensity of node i P(G|b, θ, λ) = ∏ i<j e −θiθjλbibj θiθjλbibj Aij Aij! ˆθi = ki ebi ˆλrs = ers
  36. 36. (BIG) PROBLEM 2: OVERFITTING What if we don’t know the number of groups, B? S = − log2 P(G|{bi}, {ers}, {ki})
  37. 37. (BIG) PROBLEM 2: OVERFITTING What if we don’t know the number of groups, B? S = − log2 P(G|{bi}, {ers}, {ki}) B = 2 0 20 40 60 80 100 B 0 200 400 600 800 1000 1200 S(bits)
  38. 38. (BIG) PROBLEM 2: OVERFITTING What if we don’t know the number of groups, B? S = − log2 P(G|{bi}, {ers}, {ki}) B = 3 0 20 40 60 80 100 B 0 200 400 600 800 1000 1200 S(bits)
  39. 39. (BIG) PROBLEM 2: OVERFITTING What if we don’t know the number of groups, B? S = − log2 P(G|{bi}, {ers}, {ki}) B = 4 0 20 40 60 80 100 B 0 200 400 600 800 1000 1200 S(bits)
  40. 40. (BIG) PROBLEM 2: OVERFITTING What if we don’t know the number of groups, B? S = − log2 P(G|{bi}, {ers}, {ki}) B = 5 0 20 40 60 80 100 B 0 200 400 600 800 1000 1200 S(bits)
  41. 41. (BIG) PROBLEM 2: OVERFITTING What if we don’t know the number of groups, B? S = − log2 P(G|{bi}, {ers}, {ki}) B = 6 0 20 40 60 80 100 B 0 200 400 600 800 1000 1200 S(bits)
  42. 42. (BIG) PROBLEM 2: OVERFITTING What if we don’t know the number of groups, B? S = − log2 P(G|{bi}, {ers}, {ki}) B = 10 0 20 40 60 80 100 B 0 200 400 600 800 1000 1200 S(bits)
  43. 43. (BIG) PROBLEM 2: OVERFITTING What if we don’t know the number of groups, B? S = − log2 P(G|{bi}, {ers}, {ki}) B = 20 0 20 40 60 80 100 B 0 200 400 600 800 1000 1200 S(bits)
  44. 44. (BIG) PROBLEM 2: OVERFITTING What if we don’t know the number of groups, B? S = − log2 P(G|{bi}, {ers}, {ki}) B = 40 0 20 40 60 80 100 B 0 200 400 600 800 1000 1200 S(bits)
  45. 45. (BIG) PROBLEM 2: OVERFITTING What if we don’t know the number of groups, B? S = − log2 P(G|{bi}, {ers}, {ki}) B = 70 0 20 40 60 80 100 B 0 200 400 600 800 1000 1200 S(bits)
  46. 46. (BIG) PROBLEM 2: OVERFITTING What if we don’t know the number of groups, B? S = − log2 P(G|{bi}, {ers}, {ki}) B = N 0 20 40 60 80 100 B 0 200 400 600 800 1000 1200 S(bits)
  47. 47. (BIG) PROBLEM 2: OVERFITTING What if we don’t know the number of groups, B? S = − log2 P(G|{bi}, {ers}, {ki}) B = N 0 20 40 60 80 100 B 0 200 400 600 800 1000 1200 S(bits) ... in other words: Overfitting!
  48. 48. NONPARAMETRIC BAYESIAN INFERENCE Bayes’ rule P(θ|G) = P(G|θ)P(θ) P(G) P(G|θ) → Data likelihood P(θ) → Prior P(θ|G) → Posterior P(G) = ∑θ P(G|θ)P(θ) → Model evidence
  49. 49. THE MINIMUM DESCRIPTION LENGTH PRINCIPLE (MDL) Bayesian formulation Posterior: P(θ|G) = P(G|θ)P(θ) P(G) G → Observed network θ → Model parameters: {ers}, {bi} P(θ) → Prior on the parameters Inference vs. compression − ln P(θ|G) = − ln P(G|θ) S − ln P(θ) L + ln P(G) S → Information required to describe the network, when the model is known. L → Information required to describe the model. Σ = S + L Description length Total information necessary to describe the data. Rissanen, J. Automatica 14 (5): 465–658. (1978)
  50. 50. MDL FOR THE SBM T. P. PEIXOTO, PHYS. REV. LETT. 110, 148701 (2013) Microcanonical model θ = {bi}, {ers} Data description length S = − ln P(G|{bi}, {ers}) = ∑ rs ln nrns ers + ∑ r (nr 2 ) ers/2 Partition description length P({bi}) = P({bi}|{nr})P({nr}) Lp = − ln P({bi}) = B N +ln N! − ∑ r nr! Edge description length P({ers}) = 1 Ω(B, E) Le = − ln P({ers}) = ln     B 2 E     Σ = S + Lp + Le
  51. 51. MINIMUM DESCRIPTION LENGTH S → Information required to describe the network, when the model is known. L → Information required to describe the model. Σ = S + L Description length Total information necessary to describe the data. T. P. Peixoto, Phys. Rev. Lett. 110, 148701 (2013)
  52. 52. MINIMUM DESCRIPTION LENGTH S → Information required to describe the network, when the model is known. L → Information required to describe the model. Σ = S + L Description length Total information necessary to describe the data. B = 2, S 1805.3 bits Model, L 122.6 bits Σ 1927.9 bits T. P. Peixoto, Phys. Rev. Lett. 110, 148701 (2013)
  53. 53. MINIMUM DESCRIPTION LENGTH S → Information required to describe the network, when the model is known. L → Information required to describe the model. Σ = S + L Description length Total information necessary to describe the data. B = 3, S 1688.1 bits Model, L 203.4 bits Σ 1891.5 bits T. P. Peixoto, Phys. Rev. Lett. 110, 148701 (2013)
  54. 54. MINIMUM DESCRIPTION LENGTH S → Information required to describe the network, when the model is known. L → Information required to describe the model. Σ = S + L Description length Total information necessary to describe the data. B = 4, S 1640.8 bits Model, L 270.7 bits Σ 1911.5 bits T. P. Peixoto, Phys. Rev. Lett. 110, 148701 (2013)
  55. 55. MINIMUM DESCRIPTION LENGTH S → Information required to describe the network, when the model is known. L → Information required to describe the model. Σ = S + L Description length Total information necessary to describe the data. B = 5, S 1590.5 bits Model, L 330.8 bits Σ 1921.3 bits T. P. Peixoto, Phys. Rev. Lett. 110, 148701 (2013)
  56. 56. MINIMUM DESCRIPTION LENGTH S → Information required to describe the network, when the model is known. L → Information required to describe the model. Σ = S + L Description length Total information necessary to describe the data. B = 6, S 1554.2 bits Model, L 386.7 bits Σ 1940.9 bits T. P. Peixoto, Phys. Rev. Lett. 110, 148701 (2013)
  57. 57. MINIMUM DESCRIPTION LENGTH S → Information required to describe the network, when the model is known. L → Information required to describe the model. Σ = S + L Description length Total information necessary to describe the data. B = 10, S 1451.0 bits Model, L 590.8 bits Σ 2041.8 bits T. P. Peixoto, Phys. Rev. Lett. 110, 148701 (2013)
  58. 58. MINIMUM DESCRIPTION LENGTH S → Information required to describe the network, when the model is known. L → Information required to describe the model. Σ = S + L Description length Total information necessary to describe the data. B = 20, S 1300.7 bits Model, L 1037.8 bits Σ 2338.6 bits T. P. Peixoto, Phys. Rev. Lett. 110, 148701 (2013)
  59. 59. MINIMUM DESCRIPTION LENGTH S → Information required to describe the network, when the model is known. L → Information required to describe the model. Σ = S + L Description length Total information necessary to describe the data. B = 40, S 1092.8 bits Model, L 1730.3 bits Σ 2823.1 bits T. P. Peixoto, Phys. Rev. Lett. 110, 148701 (2013)
  60. 60. MINIMUM DESCRIPTION LENGTH S → Information required to describe the network, when the model is known. L → Information required to describe the model. Σ = S + L Description length Total information necessary to describe the data. B = 70, S 881.3 bits Model, L 2427.3 bits Σ 3308.6 bits T. P. Peixoto, Phys. Rev. Lett. 110, 148701 (2013)
  61. 61. MINIMUM DESCRIPTION LENGTH S → Information required to describe the network, when the model is known. L → Information required to describe the model. Σ = S + L Description length Total information necessary to describe the data. B = N, S = 0 bits Model, L 3714.9 bits Σ 3714.9 bits T. P. Peixoto, Phys. Rev. Lett. 110, 148701 (2013)
  62. 62. MINIMUM DESCRIPTION LENGTH S → Information required to describe the network, when the model is known. L → Information required to describe the model. Σ = S + L Description length Total information necessary to describe the data. B = 3, S 1688.1 bits Model, L 203.4 bits Σ 1891.5 bits T. P. Peixoto, Phys. Rev. Lett. 110, 148701 (2013)
  63. 63. MINIMUM DESCRIPTION LENGTH S → Information required to describe the network, when the model is known. L → Information required to describe the model. Σ = S + L Description length Total information necessary to describe the data. B = 3, S 1688.1 bits Model, L 203.4 bits Σ 1891.5 bits B Σ B∗ → Occam’s razor The best model is the one which most compresses the data. T. P. Peixoto, Phys. Rev. Lett. 110, 148701 (2013)
  64. 64. WORKS VERY WELL... Generated (B = 10) r s Planted 0 5 10 15 20 B 0.5 0.4 0.3 0.2 0.1 0.0 0.1 0.2 0.3 Σb/E k =5 k =5.3 k =6 k =7 k =8 k =9 k =10 k =15 T. P. Peixoto, Phys. Rev. Lett. 110, 148701 (2013)
  65. 65. WORKS VERY WELL... 0 5 10 15 B 2.8 2.9 3.0 3.1 3.2 3.3 3.4 Σt/c/E Deg. Corr. Traditional American football 0 5 10 15 B 3.0 3.1 3.2 3.3 3.4 3.5 Σt/c/E Deg. Corr. Traditional Political Books T. P. Peixoto, Phys. Rev. Lett. 110, 148701 (2013)
  66. 66. EXAMPLE: THE INTERNET MOVIE DATABASE (IMDB) T. P. Peixoto, Phys. Rev. Lett. 110, 148701 (2013) Bipartite network of actors and films. Fairly large: N = 372, 787, E = 1, 812, 657
  67. 67. EXAMPLE: THE INTERNET MOVIE DATABASE (IMDB) T. P. Peixoto, Phys. Rev. Lett. 110, 148701 (2013) Bipartite network of actors and films. Fairly large: N = 372, 787, E = 1, 812, 657 MDL selects: B = 332
  68. 68. EXAMPLE: THE INTERNET MOVIE DATABASE (IMDB) 135 223 72 317 32 253 63 182 46 302 66 232 133 246 8 178 116 320 161 175 53 292 14 216 6 278 12 213 108 301 156 195 128 233 50 222 104 211 146 256 162 166 119 239 21 206 115 190 154 313 155 326 38 286 152 319 89 238 9 281 49 293 93 269 153 227 23 237 57 289 151 179 71 276 19 173 137 308 60 330 69 248 42 297 134 1722 285 109 192 139 325 27 185 5 328 67 220 129 224 40 299 31 183 111 199 61 228 114 221 106 242 84 189 97 291 90 294 44 304 163 203 127 324 149 268 118 234 124 201 136 288 164 212 52 177 18 240 101 298 150 254 75 193 16 235 28 176 13 266 100 329 39 275 145 171 158 181 103 258 112 327 43 210 81 271 131 321 121 249 148 279 3 230 70 262 88 280 29 274 95 284 126 252 64 209 141 245 91 314 92 261 76 174 4 188 30 250 68 296 123 196 82 322 74 307 17 208 110 243 24 312 34 323 132 244 47 306 41 167 87 194 125 272 140 205 79 311 33 247 138 229 122 305 65 170 102 270 54 331 98 318 94 300 120 217 147 303 85 255 1 225 143 315 35 309 36 264 10 277 159 191 62 200 77 186 73 295 22 168 165 236 15 198 105 310 0 215 20 197 80 214 142 187 113 251 107 257 86 180 99 207 83 273 37 184 56 204 144 169 26 263 45 259 51 219 117287 58 290 160 202 11 316 48 282 7 267 96 231 25 265 157 260 78 226 130 283 55 241 59 218 T. P. Peixoto, Phys. Rev. Lett. 110, 148701 (2013) Bipartite network of actors and films. Fairly large: N = 372, 787, E = 1, 812, 657 MDL selects: B = 332
  69. 69. EXAMPLE: THE INTERNET MOVIE DATABASE (IMDB) 135 223 72 317 32 253 63 182 46 302 66 232 133 246 8 178 116 320 161 175 53 292 14 216 6 278 12 213 108 301 156 195 128 233 50 222 104 211 146 256 162 166 119 239 21 206 115 190 154 313 155 326 38 286 152 319 89 238 9 281 49 293 93 269 153 227 23 237 57 289 151 179 71 276 19 173 137 308 60 330 69 248 42 297 134 1722 285 109 192 139 325 27 185 5 328 67 220 129 224 40 299 31 183 111 199 61 228 114 221 106 242 84 189 97 291 90 294 44 304 163 203 127 324 149 268 118 234 124 201 136 288 164 212 52 177 18 240 101 298 150 254 75 193 16 235 28 176 13 266 100 329 39 275 145 171 158 181 103 258 112 327 43 210 81 271 131 321 121 249 148 279 3 230 70 262 88 280 29 274 95 284 126 252 64 209 141 245 91 314 92 261 76 174 4 188 30 250 68 296 123 196 82 322 74 307 17 208 110 243 24 312 34 323 132 244 47 306 41 167 87 194 125 272 140 205 79 311 33 247 138 229 122 305 65 170 102 270 54 331 98 318 94 300 120 217 147 303 85 255 1 225 143 315 35 309 36 264 10 277 159 191 62 200 77 186 73 295 22 168 165 236 15 198 105 310 0 215 20 197 80 214 142 187 113 251 107 257 86 180 99 207 83 273 37 184 56 204 144 169 26 263 45 259 51 219 117287 58 290 160 202 11 316 48 282 7 267 96 231 25 265 157 260 78 226 130 283 55 241 59 218 T. P. Peixoto, Phys. Rev. Lett. 110, 148701 (2013) Bipartite network of actors and films. Fairly large: N = 372, 787, E = 1, 812, 657 MDL selects: B = 332 0 50 100 150 200 250 300 s 100 101 102 103 104 ers 103 nr 0 50 100 150 200 250 300 r 101 kr Bipartiteness is fully uncovered!
  70. 70. EXAMPLE: THE INTERNET MOVIE DATABASE (IMDB) 135 223 72 317 32 253 63 182 46 302 66 232 133 246 8 178 116 320 161 175 53 292 14 216 6 278 12 213 108 301 156 195 128 233 50 222 104 211 146 256 162 166 119 239 21 206 115 190 154 313 155 326 38 286 152 319 89 238 9 281 49 293 93 269 153 227 23 237 57 289 151 179 71 276 19 173 137 308 60 330 69 248 42 297 134 1722 285 109 192 139 325 27 185 5 328 67 220 129 224 40 299 31 183 111 199 61 228 114 221 106 242 84 189 97 291 90 294 44 304 163 203 127 324 149 268 118 234 124 201 136 288 164 212 52 177 18 240 101 298 150 254 75 193 16 235 28 176 13 266 100 329 39 275 145 171 158 181 103 258 112 327 43 210 81 271 131 321 121 249 148 279 3 230 70 262 88 280 29 274 95 284 126 252 64 209 141 245 91 314 92 261 76 174 4 188 30 250 68 296 123 196 82 322 74 307 17 208 110 243 24 312 34 323 132 244 47 306 41 167 87 194 125 272 140 205 79 311 33 247 138 229 122 305 65 170 102 270 54 331 98 318 94 300 120 217 147 303 85 255 1 225 143 315 35 309 36 264 10 277 159 191 62 200 77 186 73 295 22 168 165 236 15 198 105 310 0 215 20 197 80 214 142 187 113 251 107 257 86 180 99 207 83 273 37 184 56 204 144 169 26 263 45 259 51 219 117287 58 290 160 202 11 316 48 282 7 267 96 231 25 265 157 260 78 226 130 283 55 241 59 218 T. P. Peixoto, Phys. Rev. Lett. 110, 148701 (2013) Bipartite network of actors and films. Fairly large: N = 372, 787, E = 1, 812, 657 MDL selects: B = 332 0 50 100 150 200 250 300 s 100 101 102 103 104 ers 103 nr 0 50 100 150 200 250 300 r 101 kr Bipartiteness is fully uncovered! Detects meaningful features: Temporal Spatial (Country) Type/Genre
  71. 71. EXAMPLE: THE INTERNET MOVIE DATABASE (IMDB) 135 223 72 317 32 253 63 182 46 302 66 232 133 246 8 178 116 320 161 175 53 292 14 216 6 278 12 213 108 301 156 195 128 233 50 222 104 211 146 256 162 166 119 239 21 206 115 190 154 313 155 326 38 286 152 319 89 238 9 281 49 293 93 269 153 227 23 237 57 289 151 179 71 276 19 173 137 308 60 330 69 248 42 297 134 1722 285 109 192 139 325 27 185 5 328 67 220 129 224 40 299 31 183 111 199 61 228 114 221 106 242 84 189 97 291 90 294 44 304 163 203 127 324 149 268 118 234 124 201 136 288 164 212 52 177 18 240 101 298 150 254 75 193 16 235 28 176 13 266 100 329 39 275 145 171 158 181 103 258 112 327 43 210 81 271 131 321 121 249 148 279 3 230 70 262 88 280 29 274 95 284 126 252 64 209 141 245 91 314 92 261 76 174 4 188 30 250 68 296 123 196 82 322 74 307 17 208 110 243 24 312 34 323 132 244 47 306 41 167 87 194 125 272 140 205 79 311 33 247 138 229 122 305 65 170 102 270 54 331 98 318 94 300 120 217 147 303 85 255 1 225 143 315 35 309 36 264 10 277 159 191 62 200 77 186 73 295 22 168 165 236 15 198 105 310 0 215 20 197 80 214 142 187 113 251 107 257 86 180 99 207 83 273 37 184 56 204 144 169 26 263 45 259 51 219 117287 58 290 160 202 11 316 48 282 7 267 96 231 25 265 157 260 78 226 130 283 55 241 59 218 T. P. Peixoto, Phys. Rev. Lett. 110, 148701 (2013) Bipartite network of actors and films. Fairly large: N = 372, 787, E = 1, 812, 657 MDL selects: B = 332 0 50 100 150 200 250 300 s 100 101 102 103 104 ers 103 nr 0 50 100 150 200 250 300 r 101 kr Bipartiteness is fully uncovered! Detects meaningful features: Temporal Spatial (Country) Type/Genre
  72. 72. EXAMPLE: THE INTERNET MOVIE DATABASE (IMDB) 135 223 72 317 32 253 63 182 46 302 66 232 133 246 8 178 116 320 161 175 53 292 14 216 6 278 12 213 108 301 156 195 128 233 50 222 104 211 146 256 162 166 119 239 21 206 115 190 154 313 155 326 38 286 152 319 89 238 9 281 49 293 93 269 153 227 23 237 57 289 151 179 71 276 19 173 137 308 60 330 69 248 42 297 134 1722 285 109 192 139 325 27 185 5 328 67 220 129 224 40 299 31 183 111 199 61 228 114 221 106 242 84 189 97 291 90 294 44 304 163 203 127 324 149 268 118 234 124 201 136 288 164 212 52 177 18 240 101 298 150 254 75 193 16 235 28 176 13 266 100 329 39 275 145 171 158 181 103 258 112 327 43 210 81 271 131 321 121 249 148 279 3 230 70 262 88 280 29 274 95 284 126 252 64 209 141 245 91 314 92 261 76 174 4 188 30 250 68 296 123 196 82 322 74 307 17 208 110 243 24 312 34 323 132 244 47 306 41 167 87 194 125 272 140 205 79 311 33 247 138 229 122 305 65 170 102 270 54 331 98 318 94 300 120 217 147 303 85 255 1 225 143 315 35 309 36 264 10 277 159 191 62 200 77 186 73 295 22 168 165 236 15 198 105 310 0 215 20 197 80 214 142 187 113 251 107 257 86 180 99 207 83 273 37 184 56 204 144 169 26 263 45 259 51 219 117287 58 290 160 202 11 316 48 282 7 267 96 231 25 265 157 260 78 226 130 283 55 241 59 218 T. P. Peixoto, Phys. Rev. Lett. 110, 148701 (2013) Bipartite network of actors and films. Fairly large: N = 372, 787, E = 1, 812, 657 MDL selects: B = 332 0 50 100 150 200 250 300 s 100 101 102 103 104 ers 103 nr 0 50 100 150 200 250 300 r 101 kr Bipartiteness is fully uncovered! Detects meaningful features: Temporal Spatial (Country) Type/Genre
  73. 73. EXAMPLE: THE INTERNET MOVIE DATABASE (IMDB) 135 223 72 317 32 253 63 182 46 302 66 232 133 246 8 178 116 320 161 175 53 292 14 216 6 278 12 213 108 301 156 195 128 233 50 222 104 211 146 256 162 166 119 239 21 206 115 190 154 313 155 326 38 286 152 319 89 238 9 281 49 293 93 269 153 227 23 237 57 289 151 179 71 276 19 173 137 308 60 330 69 248 42 297 134 1722 285 109 192 139 325 27 185 5 328 67 220 129 224 40 299 31 183 111 199 61 228 114 221 106 242 84 189 97 291 90 294 44 304 163 203 127 324 149 268 118 234 124 201 136 288 164 212 52 177 18 240 101 298 150 254 75 193 16 235 28 176 13 266 100 329 39 275 145 171 158 181 103 258 112 327 43 210 81 271 131 321 121 249 148 279 3 230 70 262 88 280 29 274 95 284 126 252 64 209 141 245 91 314 92 261 76 174 4 188 30 250 68 296 123 196 82 322 74 307 17 208 110 243 24 312 34 323 132 244 47 306 41 167 87 194 125 272 140 205 79 311 33 247 138 229 122 305 65 170 102 270 54 331 98 318 94 300 120 217 147 303 85 255 1 225 143 315 35 309 36 264 10 277 159 191 62 200 77 186 73 295 22 168 165 236 15 198 105 310 0 215 20 197 80 214 142 187 113 251 107 257 86 180 99 207 83 273 37 184 56 204 144 169 26 263 45 259 51 219 117287 58 290 160 202 11 316 48 282 7 267 96 231 25 265 157 260 78 226 130 283 55 241 59 218 T. P. Peixoto, Phys. Rev. Lett. 110, 148701 (2013) Bipartite network of actors and films. Fairly large: N = 372, 787, E = 1, 812, 657 MDL selects: B = 332 0 50 100 150 200 250 300 s 100 101 102 103 104 ers 103 nr 0 50 100 150 200 250 300 r 101 kr Bipartiteness is fully uncovered! Detects meaningful features: Temporal Spatial (Country) Type/Genre
  74. 74. EXAMPLE: THE INTERNET MOVIE DATABASE (IMDB) 135 223 72 317 32 253 63 182 46 302 66 232 133 246 8 178 116 320 161 175 53 292 14 216 6 278 12 213 108 301 156 195 128 233 50 222 104 211 146 256 162 166 119 239 21 206 115 190 154 313 155 326 38 286 152 319 89 238 9 281 49 293 93 269 153 227 23 237 57 289 151 179 71 276 19 173 137 308 60 330 69 248 42 297 134 1722 285 109 192 139 325 27 185 5 328 67 220 129 224 40 299 31 183 111 199 61 228 114 221 106 242 84 189 97 291 90 294 44 304 163 203 127 324 149 268 118 234 124 201 136 288 164 212 52 177 18 240 101 298 150 254 75 193 16 235 28 176 13 266 100 329 39 275 145 171 158 181 103 258 112 327 43 210 81 271 131 321 121 249 148 279 3 230 70 262 88 280 29 274 95 284 126 252 64 209 141 245 91 314 92 261 76 174 4 188 30 250 68 296 123 196 82 322 74 307 17 208 110 243 24 312 34 323 132 244 47 306 41 167 87 194 125 272 140 205 79 311 33 247 138 229 122 305 65 170 102 270 54 331 98 318 94 300 120 217 147 303 85 255 1 225 143 315 35 309 36 264 10 277 159 191 62 200 77 186 73 295 22 168 165 236 15 198 105 310 0 215 20 197 80 214 142 187 113 251 107 257 86 180 99 207 83 273 37 184 56 204 144 169 26 263 45 259 51 219 117287 58 290 160 202 11 316 48 282 7 267 96 231 25 265 157 260 78 226 130 283 55 241 59 218 T. P. Peixoto, Phys. Rev. Lett. 110, 148701 (2013) Bipartite network of actors and films. Fairly large: N = 372, 787, E = 1, 812, 657 MDL selects: B = 332 0 50 100 150 200 250 300 s 100 101 102 103 104 ers 103 nr 0 50 100 150 200 250 300 r 101 kr Bipartiteness is fully uncovered! Detects meaningful features: Temporal Spatial (Country) Type/Genre
  75. 75. EXAMPLE: THE INTERNET MOVIE DATABASE (IMDB) 135 223 72 317 32 253 63 182 46 302 66 232 133 246 8 178 116 320 161 175 53 292 14 216 6 278 12 213 108 301 156 195 128 233 50 222 104 211 146 256 162 166 119 239 21 206 115 190 154 313 155 326 38 286 152 319 89 238 9 281 49 293 93 269 153 227 23 237 57 289 151 179 71 276 19 173 137 308 60 330 69 248 42 297 134 1722 285 109 192 139 325 27 185 5 328 67 220 129 224 40 299 31 183 111 199 61 228 114 221 106 242 84 189 97 291 90 294 44 304 163 203 127 324 149 268 118 234 124 201 136 288 164 212 52 177 18 240 101 298 150 254 75 193 16 235 28 176 13 266 100 329 39 275 145 171 158 181 103 258 112 327 43 210 81 271 131 321 121 249 148 279 3 230 70 262 88 280 29 274 95 284 126 252 64 209 141 245 91 314 92 261 76 174 4 188 30 250 68 296 123 196 82 322 74 307 17 208 110 243 24 312 34 323 132 244 47 306 41 167 87 194 125 272 140 205 79 311 33 247 138 229 122 305 65 170 102 270 54 331 98 318 94 300 120 217 147 303 85 255 1 225 143 315 35 309 36 264 10 277 159 191 62 200 77 186 73 295 22 168 165 236 15 198 105 310 0 215 20 197 80 214 142 187 113 251 107 257 86 180 99 207 83 273 37 184 56 204 144 169 26 263 45 259 51 219 117287 58 290 160 202 11 316 48 282 7 267 96 231 25 265 157 260 78 226 130 283 55 241 59 218 T. P. Peixoto, Phys. Rev. Lett. 110, 148701 (2013) Bipartite network of actors and films. Fairly large: N = 372, 787, E = 1, 812, 657 MDL selects: B = 332 0 50 100 150 200 250 300 s 100 101 102 103 104 ers 103 nr 0 50 100 150 200 250 300 r 101 kr Bipartiteness is fully uncovered! Detects meaningful features: Temporal Spatial (Country) Type/Genre
  76. 76. EXAMPLE: THE INTERNET MOVIE DATABASE (IMDB) 135 223 72 317 32 253 63 182 46 302 66 232 133 246 8 178 116 320 161 175 53 292 14 216 6 278 12 213 108 301 156 195 128 233 50 222 104 211 146 256 162 166 119 239 21 206 115 190 154 313 155 326 38 286 152 319 89 238 9 281 49 293 93 269 153 227 23 237 57 289 151 179 71 276 19 173 137 308 60 330 69 248 42 297 134 1722 285 109 192 139 325 27 185 5 328 67 220 129 224 40 299 31 183 111 199 61 228 114 221 106 242 84 189 97 291 90 294 44 304 163 203 127 324 149 268 118 234 124 201 136 288 164 212 52 177 18 240 101 298 150 254 75 193 16 235 28 176 13 266 100 329 39 275 145 171 158 181 103 258 112 327 43 210 81 271 131 321 121 249 148 279 3 230 70 262 88 280 29 274 95 284 126 252 64 209 141 245 91 314 92 261 76 174 4 188 30 250 68 296 123 196 82 322 74 307 17 208 110 243 24 312 34 323 132 244 47 306 41 167 87 194 125 272 140 205 79 311 33 247 138 229 122 305 65 170 102 270 54 331 98 318 94 300 120 217 147 303 85 255 1 225 143 315 35 309 36 264 10 277 159 191 62 200 77 186 73 295 22 168 165 236 15 198 105 310 0 215 20 197 80 214 142 187 113 251 107 257 86 180 99 207 83 273 37 184 56 204 144 169 26 263 45 259 51 219 117287 58 290 160 202 11 316 48 282 7 267 96 231 25 265 157 260 78 226 130 283 55 241 59 218 T. P. Peixoto, Phys. Rev. Lett. 110, 148701 (2013) Bipartite network of actors and films. Fairly large: N = 372, 787, E = 1, 812, 657 MDL selects: B = 332 0 50 100 150 200 250 300 s 100 101 102 103 104 ers 103 nr 0 50 100 150 200 250 300 r 101 kr Bipartiteness is fully uncovered! Detects meaningful features: Temporal Spatial (Country) Type/Genre
  77. 77. PROBLEM 3: RESOLUTION LIMIT !? Remaining networkMerge candidates Minimum detectable block size ∼ √ N. Why? 100 101 102 103 104 105 N/B 101 102 103 104 ec Detectable Undetectable N = 103 N = 104 N = 105 N = 106 T. P. Peixoto, Phys. Rev. Lett. 110, 148701 (2013)
  78. 78. RESOLUTION LIMIT: LACK OF PRIOR INFORMATION Assumption that all block structures (block graphs) occur with the same probability. Σ ∼ S Data | Model + B2 ln E Edge counts + N ln B Node partition Noninformative priors → resolution limit
  79. 79. SOLUTION: MODEL THE MODEL T. P. PEIXOTO, PHYS. REV. X 4, 011047 (2014)l = 0
  80. 80. SOLUTION: MODEL THE MODEL T. P. PEIXOTO, PHYS. REV. X 4, 011047 (2014)l = 0 l = 1
  81. 81. SOLUTION: MODEL THE MODEL T. P. PEIXOTO, PHYS. REV. X 4, 011047 (2014)l = 0 l = 1 l = 2
  82. 82. SOLUTION: MODEL THE MODEL T. P. PEIXOTO, PHYS. REV. X 4, 011047 (2014)l = 0 l = 1 l = 2 l = 3
  83. 83. SOLUTION: MODEL THE MODEL T. P. PEIXOTO, PHYS. REV. X 4, 011047 (2014)l = 0 l = 1 l = 2 l = 3 NestedmodelObservednetwork N nodes E edges B0 nodes E edges B1 nodes E edges B2 nodes E edges
  84. 84. SOLUTION: MODEL THE MODEL T. P. PEIXOTO, PHYS. REV. X 4, 011047 (2014)l = 0 l = 1 l = 2 l = 3 NestedmodelObservednetwork N nodes E edges B0 nodes E edges B1 nodes E edges B2 nodes E edges
  85. 85. SOLUTION: MODEL THE MODEL T. P. PEIXOTO, PHYS. REV. X 4, 011047 (2014)l = 0 l = 1 l = 2 l = 3 NestedmodelObservednetwork N nodes E edges B0 nodes E edges B1 nodes E edges B2 nodes E edges 100 101 102 103 104 105 N/B 101 102 103 104 ec Detectable Detectable with nested model Undetectable N = 103 N = 104 N = 105 N = 106 N/Bmax ∼ ln N √ N
  86. 86. HIERARCHICAL MODEL: BUILT-IN VALIDATION 0.5 0.6 0.7 0.8 0.9 1.0 c 0.0 0.2 0.4 0.6 0.8 1.0 NMI 1 2 3 4 5 L (a) (b) (c) (d) NMI (B=16) NMI (nested) Inferred L (a) (b) (c) (d) T. P. Peixoto, Phys. Rev. X 4, 011047 (2014)
  87. 87. EMPIRICAL NETWORKS Nonhierarchical SBM 102 103 104 105 106 107 E 101 102 103 104 N/B 0 1 2 3 4 5 6 7 8 9 10 11 12 1314 1516 17 18 19 20 21 2223 24 25 26 272829 30 31 32 Hierarchical SBM 102 103 104 105 106 107 E 101 102 103 104 N/B 0 1 2 3 4 5 6 7 8 9 1011 121314 1516 17 18 1920 21 2223 24 25 26 27 2829 30 31 32 No. N E Dir. No. N E Dir. No. N E Dir. 0 62 159 No 11 21, 363 91, 286 No 22 255, 265 2, 234, 572 Yes 1 105 441 No 12 27, 400 352, 504 Yes 23 317, 080 1, 049, 866 No 2 115 613 No 13 34, 401 421, 441 Yes 24 325, 729 1, 469, 679 Yes 3 297 2, 345 Yes 14 39, 796 301, 498 Yes 25 334, 863 925, 872 No 4 903 6, 760 No 15 52, 104 399, 625 Yes 26 372, 547 1, 812, 312 No 5 1, 222 19, 021 Yes 16 56, 739 212, 945 No 27 449, 087 4, 690, 321 Yes 6 4, 158 13, 422 No 17 75, 877 508, 836 Yes 28 654, 782 7, 499, 425 Yes 7 4, 941 6, 594 No 18 82, 168 870, 161 Yes 29 855, 802 5, 066, 842 Yes 8 8, 638 24, 806 No 19 105, 628 2, 299, 623 No 30 1, 134, 890 2, 987, 624 No 9 11, 204 117, 619 No 20 196, 591 950, 327 No 31 1, 637, 868 15, 205, 016 No 10 17, 903 196, 972 No 21 224, 832 394, 400 Yes 32 3, 764, 117 16, 511, 740 Yes No. Network No. Network No. Network 0 Dolphins 11 arXiv Co-Authors (cond-mat) 22 Web graph of stanford.edu. 1 Political Books 12 arXiv Citations (hep-th) 23 DBLP collaboration 2 American Football 13 arXiv Citations (hep-ph) 24 WWW 3 C. Elegans Neurons 14 PGP 25 Amazon product network 4 Disease Genes 15 Internet AS (Caida) 26 IMDB film-actor (bipartite) 5 Political Blogs 16 Brightkite social network 27 APS citations 6 arXiv Co-Authors (gr-qc) 17 Epinions.com trust network 28 Berkeley/Stanford web graph 7 Power Grid 18 Slashdot 29 Google web graph 8 arXiv Co-Authors (hep-th) 19 Flickr 30 Youtube social network 9 arXiv Co-Authors (hep-ph) 20 Gowalla social network 31 Yahoo groups (bipartite) 10 arXiv Co-Authors (astro-ph) 21 EU email 32 US patent citations T.P.P., Phys. Rev. X 4, 011047 (2014)
  88. 88. EMPIRICAL NETWORKS POLITICAL BLOGS (N = 1, 222, E = 19, 027) T. P. Peixoto, Phys. Rev. X 4, 011047 (2014)
  89. 89. EMPIRICAL NETWORKS POLITICAL BLOGS (N = 1, 222, E = 19, 027) T. P. Peixoto, Phys. Rev. X 4, 011047 (2014)
  90. 90. EMPIRICAL NETWORKS INTERNET (AUTONOMOUS SYSTEMS) (N = 52, 104, E = 399, 625, B = 191) T. P. Peixoto, Phys. Rev. X 4, 011047 (2014)
  91. 91. EMPIRICAL NETWORKS IMDB FILM-ACTOR NETWORK (N = 372, 447, E = 1, 812, 312, B = 717) T. P. Peixoto, Phys. Rev. X 4, 011047 (2014)
  92. 92. EMPIRICAL NETWORKS Human conectome
  93. 93. PART II Limits on inference How far can we recover what we put in?
  94. 94. SEMI-PARAMETRIC BAYESIAN INFERENCE A. DECELLE, F. KRZAKALA, C. MOORE, L .ZDEBOROVA, PHYS. REV. E 84, 066106 (2012) prs, γr → Known parameters bi → Unknown parameters P({bi}|G, {prs}, {γr}) = P(G|{bi}, {prs})P({bi}|{γr}) P(G|{prs}, {γr}) P(G|{bi}, {prs}) = ∏ i<j p Aij bibj (1 − pbibj )1−Aij P({bi}|{γr}) = ∏ i γbi P(G|{prs}, {γr}) = ∑ {bi} P(G|{bi}, {prs})P({bi}|{γr})
  95. 95. SEMI-PARAMETRIC BAYESIAN INFERENCE A. DECELLE, F. KRZAKALA, C. MOORE, L .ZDEBOROVA, PHYS. REV. E 84, 066106 (2012) We have a Potts-like model... P({bi}|G, {prs}, {γr}) = e−H({bi}) Z H({bi}) = − ∑ i<j Aij ln pbibj − (1 − Aij) ln(1 − pbibj ) − ∑ i ln γbi Z = ∑ {bi} e−H({bi})
  96. 96. SEMI-PARAMETRIC BAYESIAN INFERENCE (Physicists doing complex systems.)
  97. 97. CAVITY METHOD: SOLUTION ON A TREE Bethe free energy F = − ln Z = − ∑ i ln Zi + ∑ i<j Aij ln Zij − E Zij = N ∑ r<s prs(ψ i→j r ψ j→i s + ψ i→j s ψ j→i r ) + N ∑ r prrψ i→j r ψ j→i r Zi = ∑ r nre−hr ∏ j∈∂i ∑ r Nprbi ψ j→i r Belief-propagation (BP): ψ i→j r = 1 Zi→j γre−hr ∏ k∈∂ij ∑ s Nprsψk→i s Auxiliary fields: hr = ∑i ∑r prbi ψi r Node marginals: P(bi = r|p, γ) = ψi r = 1 Zi γr ∏ j∈∂i ∑ s (Nprs)Aij (1 − prs)1−Aij ψ j→i s Exact on trees, and asymptotically exact on SBM networks with N B. Algorithmic complexity: O(E × B2) Decelle et al, 2012; M. Mézard, A. Montanari, “Information, Physics, and Computation”, 2009
  98. 98. PHASE TRANSITION IN SBM INFERENCE A. DECELLE, F. KRZAKALA, C. MOORE, L .ZDEBOROVA, PHYS. REV. E 84, 066106 (2012) Uniform SBM with B groups of size N/B, with an assortative structure, Nprs = cinδrs + cout(1 − δrs). 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 overlap ε= cout/cin undetectabledetectable q=2, c=3 εs overlap, N=500k, BP overlap, N=70k, MC 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 100 200 300 400 500 iterations ε= cout/cin εs undetectabledetectable q=4, c=16 overlap, N=100k, BP overlap, N=70k, MC iterations, N=10k iterations, N=100k 0 0.2 0.4 0.6 0.8 1 12 13 14 15 16 17 18 0 0.1 0.2 0.3 0.4 0.5 fpara-forder c undetect. hard detect. easy detect. q=5,cin=0, N=100k cd cc cs init. ordered init. random fpara-forder Detectable phase: |cin − cout| > B k (Algorithmic independent!) Rigorously proven: e.g. E. Mossel, J. Neeman, and A. Sly (2015).; E. Mossel, J. Neeman, and A. Sly (2014); C. Bordenave, M. Lelarge, and L. Massoulié, (2015)
  99. 99. EXPECTATION MAXIMIZATON + BP Maximum likelihood for P(G|{prs}, γr) = Z Expectation-maximization algorithm Starting from an initial guess for {prs}, γr: Solve BP equations, and obtain node marginals. (Expectation step) Obtain new parameters estimates via {ˆprs}, { ˆγr} = argmax {prs},{γr} P(G|{prs}, γr) (Maximization step) Repeat. Algorithmic complexity: O(EB2) Problems: Converges to a local optimum. It is a parametric method, hence it cannot find the number of groups B.
  100. 100. EXPECTATION MAXIMIZATON + BP 14.5 14.6 14.7 14.8 14.9 15 15.1 15.2 15.3 15.4 2 3 4 5 6 7 8 -fBP number of groups q*=4, c=16 2.5 3 3.5 4 4.5 5 2 2.5 3 3.5 4 4.5 5 5.5 6 -freeenergy number of groups Zachary free energy Synthetic free energy Overfitting...
  101. 101. PART III Model variations and model selection
  102. 102. HYPOTHESIS TESTING T.P.P, PHYS. REV. X 5, 011033 (2015) Bayesian modelling allows us to select between different classes of models, according to statistical evidence. Posterior odds ratio: Λ = P({θ}a|G, Ha)P(Ha) P({θ}b|G, Hb)P(Hb) = exp(−∆Σ) Subjective significance levels: Λ Evidence strength for Hb > 1 Favors Ha 1/3 < Λ < 1 Very weak 1/10 < Λ < 1/3 Substantial 1/30 < Λ < 1/10 Strong 1/100 < Λ < 1/30 Very strong Λ < 1/100 Decisive
  103. 103. OVERLAPPING GROUPS c) (Palla et al 2005)
  104. 104. OVERLAPPING GROUPS c) (Palla et al 2005)
  105. 105. OVERLAPPING GROUPS c) (Palla et al 2005) Number of nonoverlapping partitions: BN Number of overlapping partitions: 2BN
  106. 106. OVERLAPPING GROUPS c) (Palla et al 2005) Number of nonoverlapping partitions: BN Number of overlapping partitions: 2BN
  107. 107. HYPOTHESIS TESTING T.P.P, PHYS. REV. X 5, 011033 (2015) 10−8 10−4 100 Λ 2 3 4 5 6 7 8 B 10−100 10−75 10−50 10−25 100 Λ DC, D > 1 DC, D = 1 NDC, D > 1 NDC, D = 1 B = 5, non-degree-corrected, overlapping, Λ = 1 B = 4, non-degree-corrected, overlapping, Λ 2 × 10−4, (Les Misérables) 10−4 10−2 100 Λ 5 6 7 8 9 10 11 B 10−50 10−37 10−24 10−11 Λ DC, D > 1 DC, D = 1 NDC, D > 1 NDC, D = 1 B = 10, non-degree-corrected, nonoverlapping, Λ = 1 B = 9, non-degree-corrected, nonoverlapping, Λ 0.36 (American College Football)
  108. 108. HYPOTHESIS TESTING / MODEL COMPARISON Political blogs B = 7, overlapping, degree-corrected, Λ = 1 B = 12, nonoverlapping, degree-corrected, log10 Λ −747
  109. 109. HYPOTHESIS TESTING / MODEL COMPARISON Social “ego” network (from Facebook) 0 8 16 24 k 0 1 2 3 4 nk 0 5 10 15 20 k 0 2 4 6 nk 3 6 9 k 0 1 2 3 nk 4 8 12 16 k 0 1 2 3 nk B = 4, Λ 0.053 T.P.P, Phys. Rev. X 5, 011033 (2015)
  110. 110. HYPOTHESIS TESTING / MODEL COMPARISON Social “ego” network (from Facebook) 0 8 16 24 k 0 1 2 3 4 nk 0 5 10 15 20 k 0 2 4 6 nk 3 6 9 k 0 1 2 3 nk 4 8 12 16 k 0 1 2 3 nk 0 8 16 24 k 0 1 2 3 4 nk 0 3 6 k 0 2 4 nk 0 3 6 9 k 0 1 2 3 nk 4 8 12 k 0 1 2 3 nk B = 4, Λ 0.053 B = 5, Λ = 1 T.P.P, Phys. Rev. X 5, 011033 (2015)
  111. 111. OVERLAP VS. NONOVERLAP 0.95 0.96 0.97 0.98 0.99 1.00 c 0.05 0.10 0.15 0.20 µ (a) (b) (c) (d) (e) (f) (a) B = 2, c = 0.99, µ = 0.025 (c) B = 3, c = 0.98, µ = 0.06 (e) B = 4, c = 0.97, µ = 0.12 (b) B = 3, c = 0.99, µ = 0.05 (d) B = 7, c = 0.98, µ = 0.08 (f) B = 15, c = 0.97, µ = 0.15
  112. 112. HYPOTHESIS TESTING / MODEL COMPARISON No. N k log10 ΛDCO log10 ΛDC log10 ΛNDCO log10 ΛNDC B d Σ/E 1 34 4.6 −2.1 −2.1 — 0 2 1 4 2 62 5.1 −4.6 −1.4 — 0 2 1 4.8 3 77 6.6 −17 −7.7 0 −7.3 5 1.1 4 4 105 8.4 −12 −2.8 −6.6 0 5 1 4.4 5 115 10.7 −79 −27 — 0 10 1 4.3 6 297 15.9 0 −61 −2.0 × 102 −2.1 × 102 5 1.8 5.1 7 379 4.8 −47 −6.6 0 −8.9 20 1.1 6.2 8 903 15.0 −3.8 × 102 −3.7 × 102 0 −3.7 × 102 60 1.2 3.1 9 1, 278 2.8 −8.1 0 −1.5 × 102 −89 2 1 7.4 10 1, 490 25.6 0 −5.2 × 102 −2.3 × 103 −2.3 × 103 7 1.8 4.4 11 1, 536 3.8 −2.5 × 102 0 −65 −62 38 1 6.7 12 1, 622 11.2 −4.3 × 102 0 −12 −82 48 1 3.3 13 1, 756 4.5 −43 0 −4.0 × 102 −2.8 × 102 7 1 5.9 14 2, 018 2.9 −9.2 0 −2.9 × 102 −2.1 × 102 2 1 8.5 15 4, 039 43.7 −1.5 × 103 0 −8.1 × 102 −9.5 × 102 158 1 3.2 16 4, 941 2.7 −2.2 × 102 0 −21 −25 25 1 11 17 7, 663 17.8 0 −1.1 × 104 −5.3 × 103 −1.6 × 104 85 1 3.2 18 7, 663 5.3 −1.8 × 103 0 −9.3 × 102 −7.3 × 102 63 1 5 19 8, 298 25.0 −9.1 × 103 0 −1.4 × 104 −1.4 × 104 34 1 5.4 20 9, 617 7.7 −4.2 × 103 0 −2.3 × 103 −2.5 × 103 34 1 9.3 21 26, 197 2.2 −2.4 × 103 −1.2 × 103 0 −2.7 × 103 363 1.3 4.5 22 36, 692 20.0 −4.1 × 104 0 −8.5 × 104 −2.8 × 104 1812 1 5.5 23 39, 796 15.2 −6.1 × 104 0 −8.8 × 104 −4.5 × 104 1323 1 6.3 24 52, 104 15.3 −1.5 × 105 0 −3.7 × 104 −4.0 × 104 172 1 6.4 25 58, 228 14.7 0 −5.8 × 104 −1.8 × 105 −1.4 × 105 1995 3.2 7.3 26 65, 888 305.2 −4.4 × 104 0 −4.6 × 105 −4.6 × 105 384 1 4.1 27 68, 746 1.5 −4.8 × 103 −1.4 × 103 0 −7.0 × 103 719 1.4 6.4 28 75, 888 13.4 −1.1 × 105 0 −8.2 × 104 −9.0 × 104 143 1 8.9 29 89, 209 5.3 −1.0 × 104 0 −9.7 × 103 −1.1 × 104 848 1 3.2 30 108, 300 3.5 −3.3 × 103 −5.2 × 103 0 −2.4 × 104 1660 1.8 5.7 31 133, 280 5.9 0 −4.4 × 103 −7.4 × 104 −3.8 × 104 1944 5.3 4.4 32 196, 591 19.3 0 −1.9 × 105 −7.1 × 105 −6.6 × 105 6856 3.7 7.8 33 265, 214 3.2 −1.4 × 104 0 −9.2 × 104 −8.5 × 104 549 1 8.6 34 273, 957 16.8 −5.4 × 105 0 −4.6 × 104 −7.2 × 104 727 1 5.8 35 281, 904 16.4 −1.2 × 106 0 −2.8 × 105 −1.5 × 105 6655 1 4.3 36 317, 080 6.6 −1.7 × 105 0 −3.9 × 105 −4.2 × 105 8766 1 11 37 325, 729 9.2 −5.8 × 105 0 −1.1 × 106 −2.3 × 105 4293 1 5.8 38 325, 729 9.2 −5.6 × 105 0 −1.2 × 106 −2.5 × 105 3995 1 5.8 39 334, 863 5.5 −3.3 × 105 0 −3.6 × 105 −3.4 × 104 9118 1 11 40 372, 787 9.7 −1.0 × 106 0 −1.3 × 105 −1.4 × 105 965 1 11 41 463, 347 20.3 −6.4 × 105 0 −1.8 × 106 −1.5 × 106 9276 1 9.3 42 1, 134, 890 5.3 — 0 −4.5 × 105 −4.9 × 105 264 1 13 T.P.P, Phys. Rev. X 5, 011033 (2015)
  113. 113. SBM WITH LAYERS T.P.P, PHYS. REV. E 92, 042807 (2015) C ollap sed l = 1 l = 2 l = 3 Fairly straightforward. Easily combined with degree-correction, overlaps, etc. Edge probabilities are in general different in each layer. Node memberships can move or stay the same across layer. Works as a general model for discrete as well as discretized edge covariates. Works as a model for temporal networks.
  114. 114. SBM WITH LAYERS T.P.P, PHYS. REV. E 92, 042807 (2015) Edge covariates P({Gl}|{θ}) = P(Gc|{θ}) ∏ r≤s ∏l ml rs! mrs! Independent layers P({Gl}|{{θ}l}, {φ}, {zil}) = ∏ l P(Gl|{θ}l, {φ}) Embedded models can be of any type: Traditional, degree-corrected, overlapping.
  115. 115. LAYER INFORMATION CAN REVEAL HIDDEN STRUCTURE
  116. 116. LAYER INFORMATION CAN REVEAL HIDDEN STRUCTURE
  117. 117. ... BUT IT CAN ALSO HIDE STRUCTURE! → · · · × C 0.0 0.2 0.4 0.6 0.8 1.0 c 0.0 0.2 0.4 0.6 0.8 1.0 NMI Collapsed E/C = 500 E/C = 100 E/C = 40 E/C = 20 E/C = 15 E/C = 12 E/C = 10 E/C = 5
  118. 118. MODEL SELECTION Null model: Collapsed (aggregated) SBM + fully random layers P({Gl}|{θ}, {El}) = P(Gc|{θ}) × ∏l El! E! (we can also aggregate layers into larger layers)
  119. 119. MODEL SELECTION EXAMPLE: SOCIAL NETWORK OF PHYSICIANS N = 241 Physicians Survey questions: “When you need information or advice about questions of therapy where do you usually turn?” “And who are the three or four physicians with whom you most often find yourself discussing cases or therapy in the course of an ordinary week – last week for instance?” “Would you tell me the first names of your three friends whom you see most often socially?” T.P.P, Phys. Rev. E 92, 042807 (2015)
  120. 120. MODEL SELECTION EXAMPLE: SOCIAL NETWORK OF PHYSICIANS
  121. 121. MODEL SELECTION EXAMPLE: SOCIAL NETWORK OF PHYSICIANS
  122. 122. MODEL SELECTION EXAMPLE: SOCIAL NETWORK OF PHYSICIANS Λ = 1 log10 Λ ≈ −50
  123. 123. EXAMPLE: BRAZILIAN CHAMBER OF DEPUTIES Voting network between members of congress (1999-2006) PMDB, PP, PT B DEM,PP PT PP,PMDB,PTB PDT ,PSB,PCdoBPTB,PMDBPSDB,PR DEM ,PFLPSDB PT PP,PTB Right Center Left
  124. 124. EXAMPLE: BRAZILIAN CHAMBER OF DEPUTIES Voting network between members of congress (1999-2006) PMDB, PP, PT B DEM,PP PT PP,PMDB,PTB PDT ,PSB,PCdoBPTB,PMDBPSDB,PR DEM ,PFLPSDB PT PP,PTB Right Center Left PR,PFL,PM D B PT PMDB, PP, PTB PDT,PSB,PCdoB PM D B,PPS D EM,PFL PSDB PT PMDB,PP,PTBPTB Government Opposition 1999-2002 PR,PFL,PM D B PT PMDB, PP, PTB PDT,PSB,PCdoB PM D B,PPS D EM,PFL PSDB PT PMDB,PP,PTBPTB Government Opposition 2003-2006
  125. 125. EXAMPLE: BRAZILIAN CHAMBER OF DEPUTIES Voting network between members of congress (1999-2006) PMDB, PP, PT B DEM,PP PT PP,PMDB,PTB PDT ,PSB,PCdoBPTB,PMDBPSDB,PR DEM ,PFLPSDB PT PP,PTB Right Center Left PR,PFL,PM D B PT PMDB, PP, PTB PDT,PSB,PCdoB PM D B,PPS D EM,PFL PSDB PT PMDB,PP,PTBPTB Government Opposition 1999-2002 PR,PFL,PM D B PT PMDB, PP, PTB PDT,PSB,PCdoB PM D B,PPS D EM,PFL PSDB PT PMDB,PP,PTBPTB Government Opposition 2003-2006 log10 Λ ≈ −111 Λ = 1
  126. 126. REAL-VALUED EDGES? Idea: Layers { } → bins of edge values! P({Gx}|{θ}{ }, { }) = P({Gl}|{θ}{ }, { }) × ∏ l ρ(xl) Bayesian posterior → Number (and shape) of bins
  127. 127. EXAMPLE: GLOBAL AIRPORT NETWORK 0 2000 4000 6000 8000 10000 12000 Edge distance (km) 10−7 10−6 10−5 10−4 10−3 Probabilitydensity (a) (b) (c) (d) (e) (a) (b) (c) (d) (e)
  128. 128. PROXIMITY BETWEEN HIGH-SCHOOL STUDENTS 0 2 4 6 8 10 Edge time (hours) 10−2 10−1 Probabilitydensity (a) (b) (c) (d) (e) (f) (g) (h) (i) (j) (a) (b) (c) (d) (e) (f) (g) (h) (i) (j) J. Fournet, A. Barrat. PLoS ONE 9(9):e107878 (2014), http://www.sociopatterns.org/
  129. 129. MOVEMENT BETWEEN GROUPS... PT PMDB PD T,PSB PMDB P T PSB,PDT PP,PR, PTB PCdoB,PSB PSDB PFL,DEM PPS,PT
  130. 130. TEMPORAL NETWORKS REDUX AS AN ACTUAL DYNAMICAL SYSTEM THIS TIME A more basic scenario: Sequences. xt ∈ [1, N] → Token at time t xt = {x0, x1, x2, x3, . . . } (Tokens could be letters, base pairs, itinerary stops, etc.)
  131. 131. n-ORDER MARKOV CHAINS WITH COMMUNITIES T. P. P. AND MARTIN ROSVALL, ARXIV: 1509.04740 Transitions conditioned on the last n tokens p(xt|xt−1) → Probability of transition from memory xt−1 = {xt−n, . . . , xt−1} to token xt Instead of such a direct parametrization, we divide the tokens and memories into groups: p(x|x) = θxλbxbx θx → Overall frequency of token x λrs → Transition probability from memory group s to token group r bx, bx → Group memberships of tokens and groups t f s I e a w o h b m i h i t s w b o a f e m (a) I t w a s h e b o f i m (b) t he It of st as s f es t w be wa me e o b th im ti h i t w o a s b f e m (c) {xt} = "It was the best of times"
  132. 132. n-ORDER MARKOV CHAINS WITH COMMUNITIES {xt} = “It was the best of times” P({xt}|b) = dλdθ P({xt}|b, λ, θ)P(θ)P(λ) The Markov chain likelihood is (almost) identical to the SBM likelihood that gen- erates the bipartite graph. Nonparametric → We can select the number of groups and the Markov order based on statistical evidence! T. P. P. and Martin Rosvall, arXiv: 1509.04740
  133. 133. n-ORDER MARKOV CHAINS WITH COMMUNITIES EXAMPLE: FLIGHT ITINERARIES { { T. P. P. and Martin Rosvall, arXiv: 1509.04740
  134. 134. n-ORDER MARKOV CHAINS WITH COMMUNITIES US Air Flights War and peace Taxi movements “Rock you” password list n BN BM Σ Σ BN BM Σ Σ BN BM Σ Σ BN BM Σ Σ 1 384 365 364, 385, 780 365, 211, 460 65 71 11, 422, 564 11, 438, 753 387 385 2, 635, 789 2, 975, 299 140 147 1, 060, 272, 230 1, 060, 385, 582 2 386 7605 319, 851, 871 326, 511, 545 62 435 9, 175, 833 9, 370, 379 397 1127 2, 554, 662 3, 258, 586 109 1597 984, 697, 401 987, 185, 890 3 183 2455 318, 380, 106 339, 898, 057 70 1366 7, 609, 366 8, 493, 211 393 1036 2, 590, 811 3, 258, 586 114 4703 910, 330, 062 930, 926, 370 4 292 1558 318, 842, 968 337, 988, 629 72 1150 7, 574, 332 9, 282, 611 397 1071 2, 628, 813 3, 258, 586 114 5856 889, 006, 060 940, 991, 463 5 297 1573 335, 874, 766 338, 442, 011 71 882 10, 181, 047 10, 992, 795 395 1095 2, 664, 990 3, 258, 586 99 6430 1, 000, 410, 410 1, 005, 057, 233 gzip 573, 452, 240 9, 594, 000 4, 289, 888 1, 315, 388, 208 LZMA 402, 125, 144 7, 420, 464 2, 902, 904 1, 097, 012, 288 (SBM can compress your files!) T. P. P. and Martin Rosvall, arXiv: 1509.04740
  135. 135. DYNAMIC NETWORKS Each token is an edge: xt → (i, j)t Dynamic network → Sequence of edges: {xt} = {(i, j)t} Problem: Too many possible tokens! O(N2) Solution: Group the nodes into B groups. Pair of node groups (r, s) → edge group. Number of tokens: O(B2) O(N2) Two-step generative process: {xt} = {(r, s)t} (n-order Markov chain) P((i, j)t|(r, s)t) (static SBM) T. P. P. and Martin Rosvall, arXiv: 1509.04740
  136. 136. DYNAMIC NETWORKS EXAMPLE: STUDENT PROXIMITY Static part PC PCstMPst1 M P MPst2 2BIO 1 2BIO22BIO3 PSIst T. P. P. and Martin Rosvall, arXiv: 1509.04740
  137. 137. DYNAMIC NETWORKS EXAMPLE: STUDENT PROXIMITY Static part PC PCstMPst1 M P MPst2 2BIO 1 2BIO22BIO3 PSIst Temporal part T. P. P. and Martin Rosvall, arXiv: 1509.04740
  138. 138. DYNAMIC NETWORKS CONTINUOUS TIME xτ → token at continuous time τ P({xτ}) = P({xt}) Discrete chain × P({∆t}|{xt}) Waiting times Exponential waiting time distribution P({∆t}|{xt}, λ) = ∏ x λ kx bx e−λbx ∆x Bayesian integrated likelihood P({∆t}|{xt}) = ∏ r ∞ 0 dλ λer e−λ∆r P(λ), = αE ∏ r (er − 1)! ∆er r . Jeffreys prior: P(λ) = α λ T. P. P. and Martin Rosvall, arXiv: 1509.04740
  139. 139. DYNAMIC NETWORKS CONTINUOUS TIME {xτ} → Sequence of notes in Beethoven’s fifth symphony 0 4 8 12 16 20 Memory group 10−4 10−3 10−2 10−1 Avg.waitingtime(seconds) Without waiting times (n = 1) 0 4 8 12 16 20 24 28 32 Memory group 10−7 10−6 10−5 10−4 10−3 10−2 10−1 Avg.waitingtime(seconds) With waiting times (n = 2)
  140. 140. DYNAMIC NETWORKS NONSTATIONARITY {xt} → Concatenation of “War and peace”, by Leo Tolstoy, and “À la recherche du temps perdu”, by Marcel Proust. Unmodified chain Tokens Memories − log2 P({xt}, b) = 7, 450, 322
  141. 141. DYNAMIC NETWORKS NONSTATIONARITY {xt} → Concatenation of “War and peace”, by Leo Tolstoy, and “À la recherche du temps perdu”, by Marcel Proust. Unmodified chain Tokens Memories − log2 P({xt}, b) = 7, 450, 322 Annotated chain xt = (xt, novel) Tokens Memories − log2 P({xt}, b) = 7, 146, 465
  142. 142. MODEL VARIATIONS: WEIGHTED GRAPHS C. AICHER, A. Z. JACOBS, AND A. CLAUSET. JOURNAL OF COMPLEX NETWORKS 3(2), 221-248 (2015) Aij ∈ [0, 1] → Adjacency matrix xij ∈ R → Real-valued covariates Weights distributed according to the group memberships P({xij}|{bi}, {θrs}) = ∏ i<j P(xij|θbibj ) P(x|θ) → Exponential family (Exponential, Gaussian, Pareto...) Caveats Shape of weight distribution must be defined beforehand. Parametric method (may overfit). Computationally expensive.
  143. 143. MODEL VARIATIONS: WEIGHTED GRAPHS C. AICHER, A. Z. JACOBS, AND A. CLAUSET. JOURNAL OF COMPLEX NETWORKS 3(2), 221-248 (2015) NFL games, weights: score difference (a) SBM (α = 1) (b) SBM Adjacency Matrix (c) WSBM (α = 0) 1 2 3 4 1 2 3 4 (d) WSBM Adjacency Matrix
  144. 144. MODEL VARIATIONS: ANNOTATED NETWORKS M.E.J. NEWMAN AND A. CLAUSET, ARXIV:1507.04001 Main idea: Treat metadata as data, not “ground truth”. Annotations are partitions, {xi} Can be used as priors: P(G, {xi}|θ, γ) = ∑ {bi} P(G|{bi}, θ)P({bi}|{xi}, γ) P({bi}|{xi}, γ) = ∏ i γbixu Drawbacks: Parametric (i.e. can overfit). Annotations are not always partitions.
  145. 145. MODEL VARIATIONS: ANNOTATED NETWORKS M.E.J. NEWMAN AND A. CLAUSET, ARXIV:1507.04001 Middle High Male Female White Black Hispanic Other Missing a c b FIG. 2: Three divisions of a school friendship network, using as metadata (a) school grade, (b) ethnicity, and (c) gender.
  146. 146. MODEL VARIATIONS: ANNOTATED NETWORKS DARKO HRIC, T. P. P., SANTO FORTUNATO, ARXIV:1604.00255 Generalized annotations Aij → Data layer Tij → Annotation layer D ata,A M etad ata,T Joint model for data and metadata. Arbitrary types of annotation. Both data and metadata are clustered into groups. Fully nonparametric.
  147. 147. MODEL VARIATIONS: ANNOTATED NETWORKS
  148. 148. PREDICTION OF MISSING EDGES G = G Observed ∪ δG Missing Bayesian probability of missing edges P(δG|G, {bi}) = ∑θ P(G ∪ δG|{bi}, θ)P(θ) ∑θ P(G|{bi}, θ)P(θ) = exp(−∆Σ) R. Guimerà, M Sales-Pardo, PNAS 2009; A. Clauset, C. Moore, MEJ Newman, Nature, 2008
  149. 149. APPLICATION: DRUG-DRUG INTERACTIONS R GUIMERÀ, M SALES-PARDO, PLOS COMPUT BIOL, 2013
  150. 150. METADATA AND PREDICTION OF missing nodes DARKO HRIC, T. P. P., SANTO FORTUNATO, ARXIV:1604.00255 Node probability, with known group membership: P(ai|A, bi, b) = ∑θ P(A, ai|bi, b, θ)P(θ) ∑θ P(A|b, θ)P(θ) Node probability, with unknown group membership: P(ai|A, b) = ∑ bi P(ai|A, bi, b)P(bi|b), Node probability, with unknown group membership, but known metadata: P(ai|A, T, b, c) = ∑ bi P(ai|A, bi, b)P(bi|T, b, c), Group membership probability, given metadata: P(bi|T, b, c) = P(bi, b|T, c) P(b|T, c) = ∑γ P(T|bi, b, c, γ)P(bi, b)P(γ) ∑bi ∑γ P(T|bi, b, c, γ)P(bi, b)P(γ) Predictive likelihood ratio: λi = P(ai|A, T, b, c) P(ai|A, T, b, c) + P(ai|A, b) λi > 1/2 → the metadata improves the prediction task
  151. 151. METADATA AND PREDICTION OF MISSING NODES DARKO HRIC, T. P. P., SANTO FORTUNATO, ARXIV:1604.00255
  152. 152. METADATA AND PREDICTION OF MISSING NODES DARKO HRIC, T. P. P., SANTO FORTUNATO, ARXIV:1604.00255 FB Penn PPI(krogan) FB Tennessee FB Berkeley FB Caltech FB Princeton PPI(isobasehs) PPI(yu) FB StanfordPGP PPI(gastric) FB Harvard FB Vassar Pol.Blogs PPI(pancreas)Flickr PPI(lung) PPI(predicted) AnobiiSNIM DBAmazon Debianpkgs.DBLP InternetAS APS citations LFR bench. 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Avg.likelihoodratio,λ λi = P(ai|A, T, b, c) P(ai|A, T, b, c) + P(ai|A, b)
  153. 153. METADATA AND PREDICTION OF MISSING NODES DARKO HRIC, T. P. P., SANTO FORTUNATO, ARXIV:1604.00255 Metadata Data Aligned Misaligned Random 2 3 4 5 6 7 8 9 10 Number of planted groups, B 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Averagepredictivelikelihoodratio,λ Aligned Random Misaligned Misaligned (N/B = 103)
  154. 154. METADATA AND PREDICTION OF MISSING NODES DARKO HRIC, T. P. P., SANTO FORTUNATO, ARXIV:1604.00255 Neighbor probability: Pe(i|j) = ki ebi,bj ebi ebj Neighbour probability, given metadata tag: Pt(i) = ∑ j P(i|j)Pm(j|t) Null neighbor probability (no metadata tag): Q(i) = ∑ j P(i|j)Π(j) Kullback-Leibler divergence: DKL(Pt||Q) = ∑ i Pt(i) ln Pt(i) Q(i) Relative divergence: µr ≡ DKL(Pt||Q) H(q) → Metadata group predictiveness
  155. 155. METADATA AND PREDICTION OF MISSING NODES DARKO HRIC, T. P. P., SANTO FORTUNATO, ARXIV:1604.00255 100 101 102 103 104 Metadata group size, nr 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Metadatagrouppredictiveness,µr Keywords Producers Directors Ratings Country Genre Production Year
  156. 156. GENERALIZED COMMUNITY STRUCTURE M.E.J. NEWMAN, T. P. PEIXOTO, PHYS. REV. LETT. 115, 088701 (2015) Continuous generative models Nodes inhabit a latent-space: xu ∈ [0, 1]. P(G|c, x) = ∏ u<v ω(xu, xv)Auv (1 − ω(xu, xv))1−Auv Polynomial expansion: ω(x, y) = ∑ ij cijBi(x)Bj(y) Bernstein basis: Bi(x) = n i xi (1 − x)n−i Spatial information can be learned from network topology alone. No assortativity/homophily assumed! ω(x,y)(a) Group 2 Group 3Group 1 (b) (c) Tail Head
  157. 157. OPEN PROBLEMS Knowns Better models... Mechanistic elements... Known Unknowns Quality of fit? Network comparison? Cross-validation and bootstrapping? Combination of micro- and mesoscale structures? Unknown Unknowns ? ?
  158. 158. THE END Very fast, freely available C++ code as part of the graph-tool Python library. http://graph-tool.skewed.de

×