Graph distance distribution                  for social network mining                      Paolo Boldi, Marco Rosa, Sebas...
Plan of the talkFriday, September 7, 2012
Plan of the talk               • Computing distances in large graphs (using                   HyperANF)Friday, September 7...
Plan of the talk               • Computing distances in large graphs (using                   HyperANF)               • Ru...
Plan of the talk               • Computing distances in large graphs (using                   HyperANF)               • Ru...
Prelude                            Milgram’s experiment is 45Friday, September 7, 2012
Where it all started...Friday, September 7, 2012
Where it all started...               • M. Kochen, I. de Sola Pool: Contacts and influences.                   (Manuscript,...
Where it all started...               • M. Kochen, I. de Sola Pool: Contacts and influences.                   (Manuscript,...
Where it all started...               • M. Kochen, I. de Sola Pool: Contacts and influences.                   (Manuscript,...
Milgram’s experimentFriday, September 7, 2012
Milgram’s experiment               • 300 people (starting population) are asked to                   dispatch a parcel to ...
Milgram’s experiment               • 300 people (starting population) are asked to                   dispatch a parcel to ...
Milgram’s experiment               • 300 people (starting population) are asked to                   dispatch a parcel to ...
Milgram’s experiment               • 300 people (starting population) are asked to                   dispatch a parcel to ...
Milgram’s experiment               • 300 people (starting population) are asked to                   dispatch a parcel to ...
Milgram’s experiment               • 300 people (starting population) are asked to                   dispatch a parcel to ...
Milgram’s experimentFriday, September 7, 2012
Milgram’s experiment               • Rules of the game:Friday, September 7, 2012
Milgram’s experiment               • Rules of the game:                     • parcels could be directly sent only to someo...
Milgram’s experiment               • Rules of the game:                     • parcels could be directly sent only to someo...
Milgram’s experimentFriday, September 7, 2012
Milgram’s experiment               • Questions Milgram wanted to answer:Friday, September 7, 2012
Milgram’s experiment               • Questions Milgram wanted to answer:                     • How many parcels will reach...
Milgram’s experiment               • Questions Milgram wanted to answer:                     • How many parcels will reach...
Milgram’s experiment               • Questions Milgram wanted to answer:                     • How many parcels will reach...
Milgram’s experimentFriday, September 7, 2012
Milgram’s experiment               • Answers:Friday, September 7, 2012
Milgram’s experiment               • Answers:                     • How many parcels will reach the target? 29%Friday, Sep...
Milgram’s experiment               • Answers:                     • How many parcels will reach the target? 29%           ...
Milgram’s experiment               • Answers:                     • How many parcels will reach the target? 29%           ...
Chain lengthsFriday, September 7, 2012
Milgram’s popularityFriday, September 7, 2012
Milgram’s popularity               • Six degrees of separation slipped away from the                   scientific niche to ...
Milgram’s popularity               • Six degrees of separation slipped away from the                   scientific niche to ...
Milgram’s popularity               • Six degrees of separation slipped away from the                   scientific niche to ...
Milgram’s popularity               • Six degrees of separation slipped away from the                   scientific niche to ...
Milgram’s criticismsFriday, September 7, 2012
Milgram’s criticisms               • “Could it be a big world after all? (The six-                   degrees-of-separation...
Milgram’s criticisms               • “Could it be a big world after all? (The six-                   degrees-of-separation...
Milgram’s criticisms               • “Could it be a big world after all? (The six-                   degrees-of-separation...
Measuring what?Friday, September 7, 2012
Measuring what?               • But what did Milgram’s experiment reveal, after                   all?Friday, September 7,...
Measuring what?               • But what did Milgram’s experiment reveal, after                   all?                    ...
Measuring what?               • But what did Milgram’s experiment reveal, after                   all?                    ...
HyperANF              A tool to compute distances in large graphsFriday, September 7, 2012
IntroductionFriday, September 7, 2012
Introduction             • You want to study the properties of a huge graph                 (typically: a social network)F...
Introduction             • You want to study the properties of a huge graph                 (typically: a social network) ...
Introduction             • You want to study the properties of a huge graph                 (typically: a social network) ...
Graph distances and                               distributionFriday, September 7, 2012
Graph distances and                               distribution            • Given a graph, d(x,y) is the length of the sho...
Graph distances and                               distribution            • Given a graph, d(x,y) is the length of the sho...
Graph distances and                               distribution            • Given a graph, d(x,y) is the length of the sho...
Graph distances and                               distribution            • Given a graph, d(x,y) is the length of the sho...
Exact computationFriday, September 7, 2012
Exact computation             • How can one compute the distance distribution?Friday, September 7, 2012
Exact computation             • How can one compute the distance distribution?                  • Weighted graphs: Dijkstr...
Exact computation             • How can one compute the distance distribution?                  • Weighted graphs: Dijkstr...
Exact computation             • How can one compute the distance distribution?                  • Weighted graphs: Dijkstr...
Exact computation             • How can one compute the distance distribution?                  • Weighted graphs: Dijkstr...
Sampling pairsFriday, September 7, 2012
Sampling pairs               • Sample at random pairs of nodes (x,y)Friday, September 7, 2012
Sampling pairs               • Sample at random pairs of nodes (x,y)               • Compute d(x,y) with a BFS from xFrida...
Sampling pairs               • Sample at random pairs of nodes (x,y)               • Compute d(x,y) with a BFS from x     ...
Sampling pairsFriday, September 7, 2012
Sampling pairs               • For every t, the fraction of sampled pairs that                   were found at distance t ...
Sampling pairs               • For every t, the fraction of sampled pairs that                   were found at distance t ...
Sampling sourcesFriday, September 7, 2012
Sampling sources               • Sample at random a source xFriday, September 7, 2012
Sampling sources               • Sample at random a source x               • Compute a full BFS from xFriday, September 7,...
Sampling sourcesFriday, September 7, 2012
Sampling sources               • It is an unbiased estimator only for undirected and                   connected graphsFri...
Sampling sources               • It is an unbiased estimator only for undirected and                   connected graphs   ...
Sampling sources               • It is an unbiased estimator only for undirected and                   connected graphs   ...
Sampling sources               • It is an unbiased estimator only for undirected and                   connected graphs   ...
Cohen’s samplingFriday, September 7, 2012
Cohen’s sampling               • Edith Cohen [JCSS 1997] came out with a very                   general framework for size...
Alternative: DiffusionFriday, September 7, 2012
Alternative: Diffusion               • Basic idea: Palmer et. al, KDD ’02Friday, September 7, 2012
Alternative: Diffusion               • Basic idea: Palmer et. al, KDD ’02               • Let Bt(x) be the ball of radius t...
Alternative: Diffusion               • Basic idea: Palmer et. al, KDD ’02               • Let Bt(x) be the ball of radius t...
Alternative: Diffusion               • Basic idea: Palmer et. al, KDD ’02               • Let Bt(x) be the ball of radius t...
Alternative: Diffusion               • Basic idea: Palmer et. al, KDD ’02               • Let Bt(x) be the ball of radius t...
Easy but costlyFriday, September 7, 2012
Easy but costly               • Every set requires O(n) bits, hence O(n2) bits                   overallFriday, September ...
Easy but costly               • Every set requires O(n) bits, hence O(n2) bits                   overall               • T...
Easy but costly               • Every set requires O(n) bits, hence O(n2) bits                   overall               • T...
Easy but costly               • Every set requires O(n) bits, hence O(n2) bits                   overall               • T...
Easy but costly               • Every set requires O(n) bits, hence O(n2) bits                   overall               • T...
HyperANFFriday, September 7, 2012
HyperANF                  • We used HyperLogLog counters [Flajolet et al.,                      2007]Friday, September 7, ...
HyperANF                  • We used HyperLogLog counters [Flajolet et al.,                      2007]                  • W...
HyperANF                  • We used HyperLogLog counters [Flajolet et al.,                      2007]                  • W...
Observe thatFriday, September 7, 2012
Observe that               • Every single counter has a guaranteed relative                   standard deviation (dependin...
Observe that               • Every single counter has a guaranteed relative                   standard deviation (dependin...
Observe that               • Every single counter has a guaranteed relative                   standard deviation (dependin...
Other tricksFriday, September 7, 2012
Other tricks             • We use broadword programming to compute efficiently                 unionsFriday, September 7, 2012
Other tricks             • We use broadword programming to compute efficiently                 unions             • Systolic...
Other tricks             • We use broadword programming to compute efficiently                 unions             • Systolic...
Real speed?Friday, September 7, 2012
Real speed?            • Small dimension: 1.8min vs. 4.6h on a graph with                 7.4M nodesFriday, September 7, 2...
Real speed?            • Small dimension: 1.8min vs. 4.6h on a graph with                 7.4M nodes            • Large di...
Try it!Friday, September 7, 2012
Try it!               • HyperANF is available within the webgraph                   packageFriday, September 7, 2012
Try it!               • HyperANF is available within the webgraph                   package               • Download it fr...
Try it!               • HyperANF is available within the webgraph                   package               • Download it fr...
Try it!               • HyperANF is available within the webgraph                   package               • Download it fr...
Running it on Facebook!                      [with Lars Backstrom and Johan Ugander]Friday, September 7, 2012
FacebookFriday, September 7, 2012
Facebook               • Facebook opened up to non-college students on                   September 26, 2006Friday, Septemb...
Facebook               • Facebook opened up to non-college students on                   September 26, 2006               ...
Experiments (time)               • We ran our experiments on snapshots of facebook                            • Jan 1, 200...
Experiments (dataset)               • We considered:                            • -: the whole facebook                   ...
Active users               • We only considered active users (users who have                   done some activity in the 2...
Distance distribution ($)Friday, September 7, 2012
Distance distribution ($)                                                                                fb 2007          ...
Distance distribution ($)                                                                                fb 2008          ...
Distance distribution ($)                                                                                fb 2009          ...
Distance distribution ($)                                                                                fb 2010          ...
Distance distribution ($)                                                                                fb 2011          ...
Distance distribution ($)                                                                            fb current           ...
Distance distribution (it)Friday, September 7, 2012
Distance distribution (it)                                                                                it 2007         ...
Distance distribution (it)                                                                                it 2008         ...
Distance distribution (it)                                                                                it 2009         ...
Distance distribution (it)                                                                                it 2010         ...
Distance distribution (it)                                                                                it 2011         ...
Distance distribution (it)                                                                            it current          ...
Distance distribution (se)Friday, September 7, 2012
Distance distribution (se)                                                                                se 2007         ...
Distance distribution (se)                                                                                se 2008         ...
Distance distribution (se)                                                                                se 2009         ...
Distance distribution (se)                                                                                se 2010         ...
Distance distribution (se)                                                                                se 2011         ...
Distance distribution (se)                                                                                se 2011         ...
Distance distribution (se)                                                                            se current          ...
Average distance                                                                      it                                  ...
Average distance                                                                      it                                  ...
Average distance                                                                      it                                  ...
Effective diameter (@                                                         90%)                                         ...
Harmonic diameter                                                 harmonic diameters                                      ...
Average degree vs. density ($)                                   Avg. degree   Density                            2009    ...
Actual diameter                                     2008     curr                              it      >29         =25    ...
Actual diameter                                            Used the fringe/double-sweep                                   ...
Other applications                    Spid, network robustness and more...Friday, September 7, 2012
What are distances good for?Friday, September 7, 2012
What are distances good for?             • Network models are usually studied on the base of                 the local sta...
What are distances good for?             • Network models are usually studied on the base of                 the local sta...
Global = more informative!Friday, September 7, 2012
An applicationFriday, September 7, 2012
An application               • An application: use the distance distribution as a                   graph digestFriday, Se...
An application               • An application: use the distance distribution as a                   graph digest          ...
Node eliminationFriday, September 7, 2012
Node elimination               • Consider a certain ordering of the vertices of a graphFriday, September 7, 2012
Node elimination               • Consider a certain ordering of the vertices of a graph               • Fix a threshold ϑ,...
Node elimination               • Consider a certain ordering of the vertices of a graph               • Fix a threshold ϑ,...
Experiment                                          0.12                                           0.1                    ...
Experiment (cont.)                            0.9                            0.8                            0.7           ...
Removal strategies                                                    compared                                            ...
Removal in social networks                                                  1                                             ...
FindingsFriday, September 7, 2012
Findings               • Depth-order, PR etc. are strongly disruptive on                   web graphsFriday, September 7, ...
Findings               • Depth-order, PR etc. are strongly disruptive on                   web graphs               • Prop...
Another application: SpidFriday, September 7, 2012
Another application: Spid             • We propose to use spid (shortest-paths index of                 dispersion), the r...
Another application: Spid             • We propose to use spid (shortest-paths index of                 dispersion), the r...
Another application: Spid             • We propose to use spid (shortest-paths index of                 dispersion), the r...
Spid plot                     4                  3.5                     3                  2.5        spid               ...
Spid conjectureFriday, September 7, 2012
Spid conjecture               • We conjecture that spid is able to tell social                   networks from web graphsF...
Spid conjecture               • We conjecture that spid is able to tell social                   networks from web graphs ...
Spid conjecture               • We conjecture that spid is able to tell social                   networks from web graphs ...
Spid conjecture               • We conjecture that spid is able to tell social                   networks from web graphs ...
Spid conjecture               • We conjecture that spid is able to tell social                   networks from web graphs ...
Average distance∝ Effective diameter                                   18                                   16             ...
That’s all, folks!Friday, September 7, 2012
Upcoming SlideShare
Loading in …5
×

Hyper an fandfb-handout

614 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
614
On SlideShare
0
From Embeds
0
Number of Embeds
248
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Hyper an fandfb-handout

  1. 1. Graph distance distribution for social network mining Paolo Boldi, Marco Rosa, Sebastiano Vigna Laboratory for Web Algorithmics Università degli Studi di Milano, Italy Lars Backstrom, Johan Ugander FacebookFriday, September 7, 2012
  2. 2. Plan of the talkFriday, September 7, 2012
  3. 3. Plan of the talk • Computing distances in large graphs (using HyperANF)Friday, September 7, 2012
  4. 4. Plan of the talk • Computing distances in large graphs (using HyperANF) • Running HyperANF on Facebook (the largest Milgram-like experiment ever performed)Friday, September 7, 2012
  5. 5. Plan of the talk • Computing distances in large graphs (using HyperANF) • Running HyperANF on Facebook (the largest Milgram-like experiment ever performed) • Other uses of distances (in particular: robustness)Friday, September 7, 2012
  6. 6. Prelude Milgram’s experiment is 45Friday, September 7, 2012
  7. 7. Where it all started...Friday, September 7, 2012
  8. 8. Where it all started... • M. Kochen, I. de Sola Pool: Contacts and influences. (Manuscript, early 50s)Friday, September 7, 2012
  9. 9. Where it all started... • M. Kochen, I. de Sola Pool: Contacts and influences. (Manuscript, early 50s) • A. Rapoport, W.J. Horvath: A study of a large sociogram. (Behav.Sci. 1961)Friday, September 7, 2012
  10. 10. Where it all started... • M. Kochen, I. de Sola Pool: Contacts and influences. (Manuscript, early 50s) • A. Rapoport, W.J. Horvath: A study of a large sociogram. (Behav.Sci. 1961) • S. Milgram, An experimental study of the sma$ world problem. (Sociometry, 1969)Friday, September 7, 2012
  11. 11. Milgram’s experimentFriday, September 7, 2012
  12. 12. Milgram’s experiment • 300 people (starting population) are asked to dispatch a parcel to a single individual (target)Friday, September 7, 2012
  13. 13. Milgram’s experiment • 300 people (starting population) are asked to dispatch a parcel to a single individual (target) • The target was a Boston stockbrokerFriday, September 7, 2012
  14. 14. Milgram’s experiment • 300 people (starting population) are asked to dispatch a parcel to a single individual (target) • The target was a Boston stockbroker • The starting population is selected as follows:Friday, September 7, 2012
  15. 15. Milgram’s experiment • 300 people (starting population) are asked to dispatch a parcel to a single individual (target) • The target was a Boston stockbroker • The starting population is selected as follows: • 100 were random Boston inhabitants (group A)Friday, September 7, 2012
  16. 16. Milgram’s experiment • 300 people (starting population) are asked to dispatch a parcel to a single individual (target) • The target was a Boston stockbroker • The starting population is selected as follows: • 100 were random Boston inhabitants (group A) • 100 were random Nebraska strockbrokers (group B)Friday, September 7, 2012
  17. 17. Milgram’s experiment • 300 people (starting population) are asked to dispatch a parcel to a single individual (target) • The target was a Boston stockbroker • The starting population is selected as follows: • 100 were random Boston inhabitants (group A) • 100 were random Nebraska strockbrokers (group B) • 100 were random Nebraska inhabitants (group C)Friday, September 7, 2012
  18. 18. Milgram’s experimentFriday, September 7, 2012
  19. 19. Milgram’s experiment • Rules of the game:Friday, September 7, 2012
  20. 20. Milgram’s experiment • Rules of the game: • parcels could be directly sent only to someone the sender knows personallyFriday, September 7, 2012
  21. 21. Milgram’s experiment • Rules of the game: • parcels could be directly sent only to someone the sender knows personally • 453 intermediaries happened to be involved in the experiments (besides the starting population and the target)Friday, September 7, 2012
  22. 22. Milgram’s experimentFriday, September 7, 2012
  23. 23. Milgram’s experiment • Questions Milgram wanted to answer:Friday, September 7, 2012
  24. 24. Milgram’s experiment • Questions Milgram wanted to answer: • How many parcels will reach the target?Friday, September 7, 2012
  25. 25. Milgram’s experiment • Questions Milgram wanted to answer: • How many parcels will reach the target? • What is the distribution of the number of hops required to reach the target?Friday, September 7, 2012
  26. 26. Milgram’s experiment • Questions Milgram wanted to answer: • How many parcels will reach the target? • What is the distribution of the number of hops required to reach the target? • Is this distribution different for the three starting subpopulations?Friday, September 7, 2012
  27. 27. Milgram’s experimentFriday, September 7, 2012
  28. 28. Milgram’s experiment • Answers:Friday, September 7, 2012
  29. 29. Milgram’s experiment • Answers: • How many parcels will reach the target? 29%Friday, September 7, 2012
  30. 30. Milgram’s experiment • Answers: • How many parcels will reach the target? 29% • What is the distribution of the number of hops required to reach the target? Avg. was 5.2Friday, September 7, 2012
  31. 31. Milgram’s experiment • Answers: • How many parcels will reach the target? 29% • What is the distribution of the number of hops required to reach the target? Avg. was 5.2 • Is this distribution different for the three starting subpopulations? Yes: avg. for groups A/B/C was 4.6/5.4/5.7, respectivelyFriday, September 7, 2012
  32. 32. Chain lengthsFriday, September 7, 2012
  33. 33. Milgram’s popularityFriday, September 7, 2012
  34. 34. Milgram’s popularity • Six degrees of separation slipped away from the scientific niche to enter the world of popular immagination:Friday, September 7, 2012
  35. 35. Milgram’s popularity • Six degrees of separation slipped away from the scientific niche to enter the world of popular immagination: • “Six degrees of separation” is a play by John Guare...Friday, September 7, 2012
  36. 36. Milgram’s popularity • Six degrees of separation slipped away from the scientific niche to enter the world of popular immagination: • “Six degrees of separation” is a play by John Guare... • ...a movie by Fred Schepisi...Friday, September 7, 2012
  37. 37. Milgram’s popularity • Six degrees of separation slipped away from the scientific niche to enter the world of popular immagination: • “Six degrees of separation” is a play by John Guare... • ...a movie by Fred Schepisi... • ...a song sung by dolls in their national costume at Disneyland in a heart-warming exhibition celebrating the connectedness of people allFriday, September 7, 2012
  38. 38. Milgram’s criticismsFriday, September 7, 2012
  39. 39. Milgram’s criticisms • “Could it be a big world after all? (The six- degrees-of-separation myth)” (Judith S. Kleinfeld, 2002)Friday, September 7, 2012
  40. 40. Milgram’s criticisms • “Could it be a big world after all? (The six- degrees-of-separation myth)” (Judith S. Kleinfeld, 2002) • The vast majority of chains were never completedFriday, September 7, 2012
  41. 41. Milgram’s criticisms • “Could it be a big world after all? (The six- degrees-of-separation myth)” (Judith S. Kleinfeld, 2002) • The vast majority of chains were never completed • Extremely difficult to reproduceFriday, September 7, 2012
  42. 42. Measuring what?Friday, September 7, 2012
  43. 43. Measuring what? • But what did Milgram’s experiment reveal, after all?Friday, September 7, 2012
  44. 44. Measuring what? • But what did Milgram’s experiment reveal, after all? i) That the world is smallFriday, September 7, 2012
  45. 45. Measuring what? • But what did Milgram’s experiment reveal, after all? i) That the world is small ii)That people are able to exploit this smallnessFriday, September 7, 2012
  46. 46. HyperANF A tool to compute distances in large graphsFriday, September 7, 2012
  47. 47. IntroductionFriday, September 7, 2012
  48. 48. Introduction • You want to study the properties of a huge graph (typically: a social network)Friday, September 7, 2012
  49. 49. Introduction • You want to study the properties of a huge graph (typically: a social network) • You want to obtain some information about its global structure (not simply triangle-counting/degree distribution/etc.)Friday, September 7, 2012
  50. 50. Introduction • You want to study the properties of a huge graph (typically: a social network) • You want to obtain some information about its global structure (not simply triangle-counting/degree distribution/etc.) • A natural candidate: distance distributionFriday, September 7, 2012
  51. 51. Graph distances and distributionFriday, September 7, 2012
  52. 52. Graph distances and distribution • Given a graph, d(x,y) is the length of the shortest path from x to y (∞ if one cannot go from x to y)Friday, September 7, 2012
  53. 53. Graph distances and distribution • Given a graph, d(x,y) is the length of the shortest path from x to y (∞ if one cannot go from x to y) • For undirected graphs, d(x,y)=d(y,x)Friday, September 7, 2012
  54. 54. Graph distances and distribution • Given a graph, d(x,y) is the length of the shortest path from x to y (∞ if one cannot go from x to y) • For undirected graphs, d(x,y)=d(y,x) • For every t, count the number of pairs (x,y) such that d(x,y)=tFriday, September 7, 2012
  55. 55. Graph distances and distribution • Given a graph, d(x,y) is the length of the shortest path from x to y (∞ if one cannot go from x to y) • For undirected graphs, d(x,y)=d(y,x) • For every t, count the number of pairs (x,y) such that d(x,y)=t • The fraction of pairs at distance t is (the density function of) a distributionFriday, September 7, 2012
  56. 56. Exact computationFriday, September 7, 2012
  57. 57. Exact computation • How can one compute the distance distribution?Friday, September 7, 2012
  58. 58. Exact computation • How can one compute the distance distribution? • Weighted graphs: Dijkstra (single-source: O(n2)), Floyd-Warshall (all-pairs: O(n3))Friday, September 7, 2012
  59. 59. Exact computation • How can one compute the distance distribution? • Weighted graphs: Dijkstra (single-source: O(n2)), Floyd-Warshall (all-pairs: O(n3)) • In the unweighted case:Friday, September 7, 2012
  60. 60. Exact computation • How can one compute the distance distribution? • Weighted graphs: Dijkstra (single-source: O(n2)), Floyd-Warshall (all-pairs: O(n3)) • In the unweighted case: • a single BFS solves the single-source version of the problem: O(m)Friday, September 7, 2012
  61. 61. Exact computation • How can one compute the distance distribution? • Weighted graphs: Dijkstra (single-source: O(n2)), Floyd-Warshall (all-pairs: O(n3)) • In the unweighted case: • a single BFS solves the single-source version of the problem: O(m) • if we repeat it from every source: O(nm)Friday, September 7, 2012
  62. 62. Sampling pairsFriday, September 7, 2012
  63. 63. Sampling pairs • Sample at random pairs of nodes (x,y)Friday, September 7, 2012
  64. 64. Sampling pairs • Sample at random pairs of nodes (x,y) • Compute d(x,y) with a BFS from xFriday, September 7, 2012
  65. 65. Sampling pairs • Sample at random pairs of nodes (x,y) • Compute d(x,y) with a BFS from x • (Possibly: reject the pair if d(x,y) is infinite)Friday, September 7, 2012
  66. 66. Sampling pairsFriday, September 7, 2012
  67. 67. Sampling pairs • For every t, the fraction of sampled pairs that were found at distance t are an estimator of the value of the probability mass functionFriday, September 7, 2012
  68. 68. Sampling pairs • For every t, the fraction of sampled pairs that were found at distance t are an estimator of the value of the probability mass function • Takes a BFS for every pair O(m)Friday, September 7, 2012
  69. 69. Sampling sourcesFriday, September 7, 2012
  70. 70. Sampling sources • Sample at random a source xFriday, September 7, 2012
  71. 71. Sampling sources • Sample at random a source x • Compute a full BFS from xFriday, September 7, 2012
  72. 72. Sampling sourcesFriday, September 7, 2012
  73. 73. Sampling sources • It is an unbiased estimator only for undirected and connected graphsFriday, September 7, 2012
  74. 74. Sampling sources • It is an unbiased estimator only for undirected and connected graphs • Uses anyway BFS...Friday, September 7, 2012
  75. 75. Sampling sources • It is an unbiased estimator only for undirected and connected graphs • Uses anyway BFS... • ...not cache friendlyFriday, September 7, 2012
  76. 76. Sampling sources • It is an unbiased estimator only for undirected and connected graphs • Uses anyway BFS... • ...not cache friendly • ...not compression friendlyFriday, September 7, 2012
  77. 77. Cohen’s samplingFriday, September 7, 2012
  78. 78. Cohen’s sampling • Edith Cohen [JCSS 1997] came out with a very general framework for size estimation: powerful, but doesn’t scale well, it is not easily parallelizable, requires direct accessFriday, September 7, 2012
  79. 79. Alternative: DiffusionFriday, September 7, 2012
  80. 80. Alternative: Diffusion • Basic idea: Palmer et. al, KDD ’02Friday, September 7, 2012
  81. 81. Alternative: Diffusion • Basic idea: Palmer et. al, KDD ’02 • Let Bt(x) be the ball of radius t about x (the set of nodes at distance ≤t from x)Friday, September 7, 2012
  82. 82. Alternative: Diffusion • Basic idea: Palmer et. al, KDD ’02 • Let Bt(x) be the ball of radius t about x (the set of nodes at distance ≤t from x) • Clearly B0(x)={x}Friday, September 7, 2012
  83. 83. Alternative: Diffusion • Basic idea: Palmer et. al, KDD ’02 • Let Bt(x) be the ball of radius t about x (the set of nodes at distance ≤t from x) • Clearly B0(x)={x} • Moreover Bt+1(x)=∪x→yBt(y)∪{x}Friday, September 7, 2012
  84. 84. Alternative: Diffusion • Basic idea: Palmer et. al, KDD ’02 • Let Bt(x) be the ball of radius t about x (the set of nodes at distance ≤t from x) • Clearly B0(x)={x} • Moreover Bt+1(x)=∪x→yBt(y)∪{x} • So computing Bt+1 starting from Bt one just need a single (sequential) scan of the graphFriday, September 7, 2012
  85. 85. Easy but costlyFriday, September 7, 2012
  86. 86. Easy but costly • Every set requires O(n) bits, hence O(n2) bits overallFriday, September 7, 2012
  87. 87. Easy but costly • Every set requires O(n) bits, hence O(n2) bits overall • Too many!Friday, September 7, 2012
  88. 88. Easy but costly • Every set requires O(n) bits, hence O(n2) bits overall • Too many! • What about using approximated sets?Friday, September 7, 2012
  89. 89. Easy but costly • Every set requires O(n) bits, hence O(n2) bits overall • Too many! • What about using approximated sets? • We need probabilistic counters, with just two primitives: add and size?Friday, September 7, 2012
  90. 90. Easy but costly • Every set requires O(n) bits, hence O(n2) bits overall • Too many! • What about using approximated sets? • We need probabilistic counters, with just two primitives: add and size? • Very small!Friday, September 7, 2012
  91. 91. HyperANFFriday, September 7, 2012
  92. 92. HyperANF • We used HyperLogLog counters [Flajolet et al., 2007]Friday, September 7, 2012
  93. 93. HyperANF • We used HyperLogLog counters [Flajolet et al., 2007] • With 40 bits you can count up to 4 billion with a standard deviation of 6%Friday, September 7, 2012
  94. 94. HyperANF • We used HyperLogLog counters [Flajolet et al., 2007] • With 40 bits you can count up to 4 billion with a standard deviation of 6% • Remember: one set per node!Friday, September 7, 2012
  95. 95. Observe thatFriday, September 7, 2012
  96. 96. Observe that • Every single counter has a guaranteed relative standard deviation (depending only on the number of registers per counter)Friday, September 7, 2012
  97. 97. Observe that • Every single counter has a guaranteed relative standard deviation (depending only on the number of registers per counter) • This implies a guarantee on the summation of the countersFriday, September 7, 2012
  98. 98. Observe that • Every single counter has a guaranteed relative standard deviation (depending only on the number of registers per counter) • This implies a guarantee on the summation of the counters • This gives in turn precision bounds on the estimated distribution with respect to the real oneFriday, September 7, 2012
  99. 99. Other tricksFriday, September 7, 2012
  100. 100. Other tricks • We use broadword programming to compute efficiently unionsFriday, September 7, 2012
  101. 101. Other tricks • We use broadword programming to compute efficiently unions • Systolic computation for on-demand updates of countersFriday, September 7, 2012
  102. 102. Other tricks • We use broadword programming to compute efficiently unions • Systolic computation for on-demand updates of counters • Exploited micropara$elization of multicore architecturesFriday, September 7, 2012
  103. 103. Real speed?Friday, September 7, 2012
  104. 104. Real speed? • Small dimension: 1.8min vs. 4.6h on a graph with 7.4M nodesFriday, September 7, 2012
  105. 105. Real speed? • Small dimension: 1.8min vs. 4.6h on a graph with 7.4M nodes • Large dimension: HADI [Kang et al., 2010] is a Hadoop-conscious implementation of ANF. Takes 30 minutes on a 200K-node graph (on one of the 50 world largest supercomputers). HyperANF does the same in 2.25min on our workstation (20 min on this laptop).Friday, September 7, 2012
  106. 106. Try it!Friday, September 7, 2012
  107. 107. Try it! • HyperANF is available within the webgraph packageFriday, September 7, 2012
  108. 108. Try it! • HyperANF is available within the webgraph package • Download it fromFriday, September 7, 2012
  109. 109. Try it! • HyperANF is available within the webgraph package • Download it from • http://webgraph.law.dsi.unimi.it/Friday, September 7, 2012
  110. 110. Try it! • HyperANF is available within the webgraph package • Download it from • http://webgraph.law.dsi.unimi.it/ • Or google for webgraphFriday, September 7, 2012
  111. 111. Running it on Facebook! [with Lars Backstrom and Johan Ugander]Friday, September 7, 2012
  112. 112. FacebookFriday, September 7, 2012
  113. 113. Facebook • Facebook opened up to non-college students on September 26, 2006Friday, September 7, 2012
  114. 114. Facebook • Facebook opened up to non-college students on September 26, 2006 • So, between 1 Jan 2007 and 1 Jan 2008 the number of users explodedFriday, September 7, 2012
  115. 115. Experiments (time) • We ran our experiments on snapshots of facebook • Jan 1, 2007 • Jan 1, 2008 ... • Jan 1, 2011 • [current] May, 2011Friday, September 7, 2012
  116. 116. Experiments (dataset) • We considered: • -: the whole facebook • it / se: only Italian / Swedish users • it+se: only Italian & Swedish users • us: only US users • Based on users’ current geo-IP locationFriday, September 7, 2012
  117. 117. Active users • We only considered active users (users who have done some activity in the 28 days preceding 9 Jun 2011) • So we are not considering “old” users that are not active any more • For - [current] we have about 750M nodesFriday, September 7, 2012
  118. 118. Distance distribution ($)Friday, September 7, 2012
  119. 119. Distance distribution ($) fb 2007 0.7 0.6 0.5 ● 0.4 % pairs ● 0.3 0.2 0.1 ● ● ● 0.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 5 10 15 distanceFriday, September 7, 2012
  120. 120. Distance distribution ($) fb 2008 0.7 0.6 0.5 ● 0.4 % pairs 0.3 ● 0.2 ● 0.1 ● ● ● 0.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 5 10 15 distanceFriday, September 7, 2012
  121. 121. Distance distribution ($) fb 2009 0.7 0.6 0.5 ● 0.4 % pairs 0.3 ● 0.2 ● 0.1 ● 0.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 5 10 15 distanceFriday, September 7, 2012
  122. 122. Distance distribution ($) fb 2010 0.7 0.6 0.5 0.4 ● % pairs 0.3 ● 0.2 ● 0.1 ● 0.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 5 10 15 distanceFriday, September 7, 2012
  123. 123. Distance distribution ($) fb 2011 0.7 0.6 0.5 0.4 ● % pairs 0.3 ● 0.2 ● 0.1 ● 0.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 5 10 15 distanceFriday, September 7, 2012
  124. 124. Distance distribution ($) fb current 0.7 0.6 ● 0.5 0.4 % pairs 0.3 ● 0.2 ● 0.1 ● 0.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 5 10 15 distanceFriday, September 7, 2012
  125. 125. Distance distribution (it)Friday, September 7, 2012
  126. 126. Distance distribution (it) it 2007 0.7 0.6 0.5 0.4 % pairs 0.3 0.2 0.1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.0 ● 0 5 10 15 distanceFriday, September 7, 2012
  127. 127. Distance distribution (it) it 2008 0.7 0.6 0.5 0.4 % pairs 0.3 ● ● 0.2 ● 0.1 ● ● ● ● ● ● 0.0 ● ● ● ● ● ● ● ● ● ● ● 0 5 10 15 distanceFriday, September 7, 2012
  128. 128. Distance distribution (it) it 2009 0.7 0.6 0.5 ● 0.4 ● % pairs 0.3 0.2 0.1 ● ● ● 0.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 5 10 15 distanceFriday, September 7, 2012
  129. 129. Distance distribution (it) it 2010 0.7 ● 0.6 0.5 0.4 % pairs 0.3 0.2 ● ● 0.1 ● 0.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 5 10 15 distanceFriday, September 7, 2012
  130. 130. Distance distribution (it) it 2011 0.7 ● 0.6 0.5 0.4 % pairs 0.3 ● 0.2 0.1 ● 0.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 5 10 15 distanceFriday, September 7, 2012
  131. 131. Distance distribution (it) it current 0.7 ● 0.6 0.5 0.4 % pairs 0.3 ● 0.2 ● 0.1 ● 0.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 5 10 15 distanceFriday, September 7, 2012
  132. 132. Distance distribution (se)Friday, September 7, 2012
  133. 133. Distance distribution (se) se 2007 0.7 0.6 0.5 0.4 % pairs 0.3 0.2 ● ● ● ● 0.1 ● ● ● ● ● ● 0.0 ● ● ● ● ● ● ● ● ● ● 0 5 10 15 distanceFriday, September 7, 2012
  134. 134. Distance distribution (se) se 2008 0.7 0.6 ● 0.5 0.4 % pairs ● 0.3 0.2 0.1 ● ● 0.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 5 10 15 distanceFriday, September 7, 2012
  135. 135. Distance distribution (se) se 2009 0.7 0.6 ● 0.5 0.4 % pairs 0.3 ● 0.2 ● 0.1 ● 0.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 5 10 15 distanceFriday, September 7, 2012
  136. 136. Distance distribution (se) se 2010 0.7 0.6 ● 0.5 0.4 % pairs 0.3 0.2 ● ● 0.1 ● 0.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 5 10 15 distanceFriday, September 7, 2012
  137. 137. Distance distribution (se) se 2011 0.7 ● 0.6 0.5 0.4 % pairs 0.3 ● 0.2 ● 0.1 ● 0.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 5 10 15 distanceFriday, September 7, 2012
  138. 138. Distance distribution (se) se 2011 0.7 ● 0.6 0.5 0.4 % pairs 0.3 ● 0.2 ● 0.1 ● 0.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 5 10 15 distanceFriday, September 7, 2012
  139. 139. Distance distribution (se) se current 0.7 0.6 0.5 0.4 ● % pairs 0.3 ● 0.2 ● 0.1 ● 0.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 5 10 15 distanceFriday, September 7, 2012
  140. 140. Average distance it se ● ● 2008 curr 4.0 4.5 5.0 5.5 avg. distance avg. distance 4 5 6 7 8 9 ● ● ● ● ● 6.58 3.90 ● ● ● ● ● 2007 2008 2009 2010 2011 current 2007 2008 2009 2010 2011 current it year year 4.33 3.89 itse us ● ● ● se 9 4.7 avg. distance avg. distance 8 ● 7 4.5 6 4.9 4.16 ● it+se 5 ● ● ● ● 4.3 ● ● ● 4 2007 2008 2009 2010 2011 current 2007 2008 2009 2010 2011 current year year ● ● fb us 4.74 4.32 4.6 4.8 5.0 5.2 avg. distance ● ● ● ● $ 5.28 4.74 2007 2008 2009 2010 2011 current yearFriday, September 7, 2012
  141. 141. Average distance it se ● ● 2008 curr 4.0 4.5 5.0 5.5 avg. distance avg. distance 4 5 6 7 8 9 ● ● ● ● ● 6.58 3.90 ● ● ● ● ● 2007 2008 2009 2010 2011 current 2007 2008 2009 2010 2011 current it year year 4.33 3.89 itse us ● ● ● se 9 4.7 avg. distance avg. distance 8 ● 7 4.5 6 4.9 4.16 ● it+se 5 ● ● ● ● 4.3 ● ● ● 4 2007 2008 2009 2010 2011 current 2007 2008 2009 2010 2011 current year year ● ● fb us 4.74 4.32 4.6 4.8 5.0 5.2 avg. distance ● ● ● ● $ 5.28 4.74 2007 2008 2009 2010 2011 current yearFriday, September 7, 2012
  142. 142. Average distance it se ● ● 2008 curr 4.0 4.5 5.0 5.5 avg. distance avg. distance 4 5 6 7 8 9 ● ● ● ● ● 6.58 3.90 ● ● ● ● ● 2007 2008 2009 2010 2011 current 2007 2008 2009 2010 2011 current it year year 4.33 3.89 itse us ● ● ● se 9 4.7 avg. distance avg. distance 8 ● 7 4.5 6 4.9 4.16 ● it+se 5 ● ● ● ● 4.3 ● ● ● 4 2007 2008 2009 2010 2011 current 2007 2008 2009 2010 2011 current year year ● ● fb us 4.74 4.32 4.6 4.8 5.0 5.2 avg. distance ● ● ● ● $ 5.28 4.74 2007 2008 2009 2010 2011 current year $ (current): 92% pairs are reachable!Friday, September 7, 2012
  143. 143. Effective diameter (@ 90%) effective diameters ● 2008 curr 8 ● ● ● ● ● ● it 9.0 5.2 ● ● ● ● 6 ● ● ● 5.9 5.3 ● ● ● se effective diam. ● ● 4 it * * * it se itse us +se 6.8 5.8 * * fb 2 us 6.5 5.8 0 2008 2009 2010 2011 $ 7.0 6.2 yearFriday, September 7, 2012
  144. 144. Harmonic diameter harmonic diameters 2008 curr 30 ● 23.7 3.4 25 it * it 20 * se * itse se 4.5 4.0 harmonic diam. * us * fb 15 it +se 5.8 3.8 10 ● ● ● ● ● ● us 4.6 4.4 5 ● ● ● ● ● ● ● ● ● ● ● 0 2008 2009 2010 2011 $ 5.7 4.6 yearFriday, September 7, 2012
  145. 145. Average degree vs. density ($) Avg. degree Density 2009 88.7 6.4 * 10-7 2010 113.0 3.4 * 10-7 2011 169.0 3.0 * 10-7 curr 190.4 2.6 * 10-7Friday, September 7, 2012
  146. 146. Actual diameter 2008 curr it >29 =25 se >16 =25 it+se >21 =27 us >17 =30 # >16 >58Friday, September 7, 2012
  147. 147. Actual diameter Used the fringe/double-sweep technique for “=” 2008 curr it >29 =25 se >16 =25 it+se >21 =27 us >17 =30 # >16 >58Friday, September 7, 2012
  148. 148. Other applications Spid, network robustness and more...Friday, September 7, 2012
  149. 149. What are distances good for?Friday, September 7, 2012
  150. 150. What are distances good for? • Network models are usually studied on the base of the local statistics they produceFriday, September 7, 2012
  151. 151. What are distances good for? • Network models are usually studied on the base of the local statistics they produce • Not difficult to obtain models that behave correctly locally (i.e., as far as degree distribution, assortativity, clustering coefficients... are concerned)Friday, September 7, 2012
  152. 152. Global = more informative!Friday, September 7, 2012
  153. 153. An applicationFriday, September 7, 2012
  154. 154. An application • An application: use the distance distribution as a graph digestFriday, September 7, 2012
  155. 155. An application • An application: use the distance distribution as a graph digest • Typical example: if I modify the graph with a certain criterion, how much does the distance distribution change?Friday, September 7, 2012
  156. 156. Node eliminationFriday, September 7, 2012
  157. 157. Node elimination • Consider a certain ordering of the vertices of a graphFriday, September 7, 2012
  158. 158. Node elimination • Consider a certain ordering of the vertices of a graph • Fix a threshold ϑ, delete all vertices (and all incident arcs) in the specified order, until ϑm arcs have been deletedFriday, September 7, 2012
  159. 159. Node elimination • Consider a certain ordering of the vertices of a graph • Fix a threshold ϑ, delete all vertices (and all incident arcs) in the specified order, until ϑm arcs have been deleted • Compute the “difference” between the graph you obtained and the original oneFriday, September 7, 2012
  160. 160. Experiment 0.12 0.1 0.08 probability 0.06 0.04 0.02 0 1 10 100 length 0.00 0.05 0.15 0.30 0.01 0.10 0.20 Deleting nodes in order of (syntactic) depthFriday, September 7, 2012
  161. 161. Experiment (cont.) 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.05 0.1 0.15 0.2 0.25 0.3 θ Kullback-Leibler L1 δ-average distance L2 Distribution divergence (various measures)Friday, September 7, 2012
  162. 162. Removal strategies compared 1 0.8 δ-average distance 0.6 0.4 0.2 0 0 0.05 0.1 0.15 0.2 0.25 0.3 θ random PR near-root degree LPFriday, September 7, 2012
  163. 163. Removal in social networks 1 0.8 δ-average distance 0.6 0.4 0.2 0 0 0.05 0.1 0.15 0.2 0.25 0.3 θ random degree PR LPFriday, September 7, 2012
  164. 164. FindingsFriday, September 7, 2012
  165. 165. Findings • Depth-order, PR etc. are strongly disruptive on web graphsFriday, September 7, 2012
  166. 166. Findings • Depth-order, PR etc. are strongly disruptive on web graphs • Proper social networks are much more robust, still being similar to web graphs under many respectsFriday, September 7, 2012
  167. 167. Another application: SpidFriday, September 7, 2012
  168. 168. Another application: Spid • We propose to use spid (shortest-paths index of dispersion), the ratio between variance and average in the distance distributionFriday, September 7, 2012
  169. 169. Another application: Spid • We propose to use spid (shortest-paths index of dispersion), the ratio between variance and average in the distance distribution • When the dispersion index is <1, the distribution is subdispersed; >1, is superdispersedFriday, September 7, 2012
  170. 170. Another application: Spid • We propose to use spid (shortest-paths index of dispersion), the ratio between variance and average in the distance distribution • When the dispersion index is <1, the distribution is subdispersed; >1, is superdispersed • Web graphs and social networks are different under this viewpoint!Friday, September 7, 2012
  171. 171. Spid plot 4 3.5 3 2.5 spid 2 1.5 1 0.5 0 10000 100000 1e+06 1e+07 1e+08 1e+09 1e+10 sizeFriday, September 7, 2012
  172. 172. Spid conjectureFriday, September 7, 2012
  173. 173. Spid conjecture • We conjecture that spid is able to tell social networks from web graphsFriday, September 7, 2012
  174. 174. Spid conjecture • We conjecture that spid is able to tell social networks from web graphs • Averag distance alone would not suffice: it is very changeable and depends on the scaleFriday, September 7, 2012
  175. 175. Spid conjecture • We conjecture that spid is able to tell social networks from web graphs • Averag distance alone would not suffice: it is very changeable and depends on the scale • Spid, instead, seems to have a clear cutpoint at 1Friday, September 7, 2012
  176. 176. Spid conjecture • We conjecture that spid is able to tell social networks from web graphs • Averag distance alone would not suffice: it is very changeable and depends on the scale • Spid, instead, seems to have a clear cutpoint at 1 • What is Facebook spid?Friday, September 7, 2012
  177. 177. Spid conjecture • We conjecture that spid is able to tell social networks from web graphs • Averag distance alone would not suffice: it is very changeable and depends on the scale • Spid, instead, seems to have a clear cutpoint at 1 • What is Facebook spid? [Answer: 0.093]Friday, September 7, 2012
  178. 178. Average distance∝ Effective diameter 18 16 14 average distance 12 10 8 6 4 2 0 5 10 15 20 25 30 interpolated effective diameterFriday, September 7, 2012
  179. 179. That’s all, folks!Friday, September 7, 2012

×