Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Sampling methods for graphs

1,329 views

Published on

Talk given at 3rd ISNPS Conference in Avignon, June 2016 in Ricardo Cao's invited session.

Published in: Science
  • Be the first to comment

  • Be the first to like this

Sampling methods for graphs

  1. 1. Statistics and networks : motivations and methods Survey sampling Extending the sampling design Results and future work An overview of sampling methods for graphs - Application to Twitter Antoine Rebecq Universit´e Paris X - INSEE 6/15/16 Antoine Rebecq Sampling the Twitter graph
  2. 2. Statistics and networks : motivations and methods Survey sampling Extending the sampling design Results and future work 1 Statistics and networks : motivations and methods Graphs and stats Methods 2 Survey sampling Estimates Sampling design 3 Extending the sampling design Snowball sampling Adaptive sampling 4 Results and future work Results Sample size Future work Antoine Rebecq Sampling the Twitter graph
  3. 3. Statistics and networks : motivations and methods Survey sampling Extending the sampling design Results and future work Graphs and stats Methods Section 1 Statistics and networks : motivations and methods Antoine Rebecq Sampling the Twitter graph
  4. 4. Statistics and networks : motivations and methods Survey sampling Extending the sampling design Results and future work Graphs and stats Methods Subsection 1 Graphs and stats Antoine Rebecq Sampling the Twitter graph
  5. 5. Statistics and networks : motivations and methods Survey sampling Extending the sampling design Results and future work Graphs and stats Methods Examples of statistics on graphs Official statistics : measuring “hidden populations” Antoine Rebecq Sampling the Twitter graph
  6. 6. Statistics and networks : motivations and methods Survey sampling Extending the sampling design Results and future work Graphs and stats Methods Examples of statistics on graphs Rise of “big graphs” Antoine Rebecq Sampling the Twitter graph
  7. 7. Statistics and networks : motivations and methods Survey sampling Extending the sampling design Results and future work Graphs and stats Methods Statistics of interest Degree Centrality Clustering Communities . . . Antoine Rebecq Sampling the Twitter graph
  8. 8. Statistics and networks : motivations and methods Survey sampling Extending the sampling design Results and future work Graphs and stats Methods Subsection 2 Methods Antoine Rebecq Sampling the Twitter graph
  9. 9. Statistics and networks : motivations and methods Survey sampling Extending the sampling design Results and future work Graphs and stats Methods Methods for graph statistics Algorithms (computer science, “big data”) Model-based estimation Sampling (“Design-based estimation”) Antoine Rebecq Sampling the Twitter graph
  10. 10. Statistics and networks : motivations and methods Survey sampling Extending the sampling design Results and future work Graphs and stats Methods Methods for graph statistics Antoine Rebecq Sampling the Twitter graph
  11. 11. Statistics and networks : motivations and methods Survey sampling Extending the sampling design Results and future work Graphs and stats Methods Computer science methods Efficient algorithms (speed / memory). Antoine Rebecq Sampling the Twitter graph
  12. 12. Statistics and networks : motivations and methods Survey sampling Extending the sampling design Results and future work Graphs and stats Methods Big data begets big graph Twitter in 2013 Image from [1] Antoine Rebecq Sampling the Twitter graph
  13. 13. Statistics and networks : motivations and methods Survey sampling Extending the sampling design Results and future work Graphs and stats Methods Computer science methods Efficient algorithms (speed / memory). Sometimes require sampling. Antoine Rebecq Sampling the Twitter graph
  14. 14. Statistics and networks : motivations and methods Survey sampling Extending the sampling design Results and future work Graphs and stats Methods Model-based estimation Famous graph models : Erd˝os-R´enyi Price / Barab´asi-Albert (High tailed degree distribution) Watts-Strogatz / “small-world” (short path lengths) Stochastic block models (communities) Images from [9] Antoine Rebecq Sampling the Twitter graph
  15. 15. Statistics and networks : motivations and methods Survey sampling Extending the sampling design Results and future work Graphs and stats Methods Model-based estimation : Erd˝os-R´enyi (“random graphs”) Antoine Rebecq Sampling the Twitter graph
  16. 16. Statistics and networks : motivations and methods Survey sampling Extending the sampling design Results and future work Graphs and stats Methods Model-based estimation : Barab´asi-Albert (“preferential attachment”) Antoine Rebecq Sampling the Twitter graph
  17. 17. Statistics and networks : motivations and methods Survey sampling Extending the sampling design Results and future work Graphs and stats Methods Model-based estimation : Watts-Strogatz (“small world”) Antoine Rebecq Sampling the Twitter graph
  18. 18. Statistics and networks : motivations and methods Survey sampling Extending the sampling design Results and future work Graphs and stats Methods Model-based estimation : Stochastic Block Models Antoine Rebecq Sampling the Twitter graph
  19. 19. Statistics and networks : motivations and methods Survey sampling Extending the sampling design Results and future work Graphs and stats Methods Example : Star Wars : The Force Awakens Star Wars : The Force Awakens Antoine Rebecq Sampling the Twitter graph
  20. 20. Statistics and networks : motivations and methods Survey sampling Extending the sampling design Results and future work Graphs and stats Methods Example : Star Wars : The Force Awakens How many (real) users behind these tweets ? Antoine Rebecq Sampling the Twitter graph
  21. 21. Statistics and networks : motivations and methods Survey sampling Extending the sampling design Results and future work Graphs and stats Methods Example : “Star Wars, The Force Awakens” Let’s write : yk = Number of tweets @starwars by user k between 10/29/15, 7 :48 - 10 :48 PM EST zk = 1{yk ≥ 1} Goal : estimate NC = T(Z) Additionally, we write : nC = k∈s zk Antoine Rebecq Sampling the Twitter graph
  22. 22. Statistics and networks : motivations and methods Survey sampling Extending the sampling design Results and future work Graphs and stats Methods The Twitter graph The Twitter graph ([6]) : Is directed Degree distribution is heavy-tailed Antoine Rebecq Sampling the Twitter graph
  23. 23. Statistics and networks : motivations and methods Survey sampling Extending the sampling design Results and future work Graphs and stats Methods The Twitter graph Has small path lengths Antoine Rebecq Sampling the Twitter graph
  24. 24. Statistics and networks : motivations and methods Survey sampling Extending the sampling design Results and future work Graphs and stats Methods Sampling / Design-based estimation Sampling : select a few vertices/edges and compute estimators using sample data. Very little exists about design-based statistical inference on networks (Kolaczyk 2009 , [5]) We try survey sampling methods used in official Statistics Institutes to make design-based inference about “big graphs” Antoine Rebecq Sampling the Twitter graph
  25. 25. Statistics and networks : motivations and methods Survey sampling Extending the sampling design Results and future work Estimates Sampling design Section 2 Survey sampling Antoine Rebecq Sampling the Twitter graph
  26. 26. Statistics and networks : motivations and methods Survey sampling Extending the sampling design Results and future work Estimates Sampling design Subsection 1 Estimates Antoine Rebecq Sampling the Twitter graph
  27. 27. Statistics and networks : motivations and methods Survey sampling Extending the sampling design Results and future work Estimates Sampling design Horvitz-Thompson estimator Population U : vertices of the Twitter graph. Assign all k ∈ U an inclusion probability P(k ∈ s) = πk Antoine Rebecq Sampling the Twitter graph
  28. 28. Statistics and networks : motivations and methods Survey sampling Extending the sampling design Results and future work Estimates Sampling design Horvitz-Thompson estimator Classic unbiased estimator for totals and means : Horvitz-Thompson ˆT(Y )HT = k∈s yk πk ˆ¯y = 1 N k∈s yk πk Antoine Rebecq Sampling the Twitter graph
  29. 29. Statistics and networks : motivations and methods Survey sampling Extending the sampling design Results and future work Estimates Sampling design Horvitz-Thompson estimator Variance of the Horvitz-Thompson estimator depends on the first and second-order inclusion probabilities : πk = P(k ∈ s) πkl = P(k, l ∈ s) V( ˆT(Y )HT ) = k∈U l∈U (πkl − πkπl ) yk πk yl πl Antoine Rebecq Sampling the Twitter graph
  30. 30. Statistics and networks : motivations and methods Survey sampling Extending the sampling design Results and future work Estimates Sampling design Calibrated estimator Deville-Sarndal, 1992 ([2]). Modification of the Horvitz-Thompson estimator to take auxiliary information into account. For example : T(Y ) = Number of tweets @StarWars N = Number of users in scope Structure of number of followers Number of verified users . . . Very similar to empirical likelihood methods ([8]). Antoine Rebecq Sampling the Twitter graph
  31. 31. Statistics and networks : motivations and methods Survey sampling Extending the sampling design Results and future work Estimates Sampling design Subsection 2 Sampling design Antoine Rebecq Sampling the Twitter graph
  32. 32. Statistics and networks : motivations and methods Survey sampling Extending the sampling design Results and future work Estimates Sampling design Sampling design : Bernoulli Poisson sampling : For each k ∈ U , run a πk-Bernoulli experiment to decide whether to include unit k in the sample. Bernoulli sampling : ∀k, πk = p Sampling design of non-fixed sample size. We set the expected sample size to 20000. Antoine Rebecq Sampling the Twitter graph
  33. 33. Statistics and networks : motivations and methods Survey sampling Extending the sampling design Results and future work Estimates Sampling design Sampling design : Stratified Bernoulli More efficient estimators/design : use of external information. Antoine Rebecq Sampling the Twitter graph
  34. 34. Statistics and networks : motivations and methods Survey sampling Extending the sampling design Results and future work Estimates Sampling design Sampling design : Stratified Bernoulli We write : U = U1 U2 (h = 1, 2 being called “strata”) and draw two independant Bernoulli samples in U1 and U2. Here : U1 = Followers of official @starwars account U2 = Rest of Twitter users Antoine Rebecq Sampling the Twitter graph
  35. 35. Statistics and networks : motivations and methods Survey sampling Extending the sampling design Results and future work Estimates Sampling design Sampling design : Neyman allocation Optimal variance of the Horvitz-Thompson estimator is obtained for (Neyman, [7]) : nh = NhS2 h h NhS2 h Given the expected values, we set : n1 = 9700 n2 = 10300 Antoine Rebecq Sampling the Twitter graph
  36. 36. Statistics and networks : motivations and methods Survey sampling Extending the sampling design Results and future work Estimates Sampling design Sampling design : Stratified Bernoulli Estimators for the two “simple” designs : ˆNC1 = nC p ˆNC2 = N1 n1 nC1 + N − N1 n2 nC2 Antoine Rebecq Sampling the Twitter graph
  37. 37. Statistics and networks : motivations and methods Survey sampling Extending the sampling design Results and future work Estimates Sampling design Variance estimators ˆV( ˆT(Y ))1 = 1 p ( 1 p − 1) k∈s y2 k ˆV( ˆT(Y ))2 = 2 h=1 1 ph ( 1 ph − 1) k∈sh y2 k Antoine Rebecq Sampling the Twitter graph
  38. 38. Statistics and networks : motivations and methods Survey sampling Extending the sampling design Results and future work Snowball sampling Adaptive sampling Section 3 Extending the sampling design Antoine Rebecq Sampling the Twitter graph
  39. 39. Statistics and networks : motivations and methods Survey sampling Extending the sampling design Results and future work Snowball sampling Adaptive sampling Snowball sampling From now on, our sampling designs will include extensions : s = s0 ∪ sext s0 is still selected using stratified Bernoulli, but with expected sample size of 1000, so that the expected sample size of s is more or less 20000. Antoine Rebecq Sampling the Twitter graph
  40. 40. Statistics and networks : motivations and methods Survey sampling Extending the sampling design Results and future work Snowball sampling Adaptive sampling Subsection 1 Snowball sampling Antoine Rebecq Sampling the Twitter graph
  41. 41. Statistics and networks : motivations and methods Survey sampling Extending the sampling design Results and future work Snowball sampling Adaptive sampling Snowball sampling Population U Antoine Rebecq Sampling the Twitter graph
  42. 42. Statistics and networks : motivations and methods Survey sampling Extending the sampling design Results and future work Snowball sampling Adaptive sampling Snowball sampling Initial sample s0 Antoine Rebecq Sampling the Twitter graph
  43. 43. Statistics and networks : motivations and methods Survey sampling Extending the sampling design Results and future work Snowball sampling Adaptive sampling Snowball sampling One stage snowball extension s = A(s0) Antoine Rebecq Sampling the Twitter graph
  44. 44. Statistics and networks : motivations and methods Survey sampling Extending the sampling design Results and future work Snowball sampling Adaptive sampling Snowball sampling Formally, we write : Bi = {i} ∪ {j ∈ V , Eji = ∅} Ai = {i} ∪ {j ∈ V , Eij = ∅} s = A(s0) Antoine Rebecq Sampling the Twitter graph
  45. 45. Statistics and networks : motivations and methods Survey sampling Extending the sampling design Results and future work Snowball sampling Adaptive sampling Snowball sampling ˆNC3 = k∈s zi 1 − ¯π(Bi ) where : ¯π(Bi ) = P(Bi ⊂ ¯s) = k∈Bi (1 − P(k ∈ s)) = q #(Bi ∩U1) S1 · q #(Bi ∩U2) S2 Antoine Rebecq Sampling the Twitter graph
  46. 46. Statistics and networks : motivations and methods Survey sampling Extending the sampling design Results and future work Snowball sampling Adaptive sampling Snowball sampling ˆV( ˆNC3) = i∈s j∈s zi zj ¯π(Bi ∪ Bj ) γij where : γij = ¯π(Bi ∪ Bj ) − ¯π(Bi )¯π(Bj ) [1 − ¯π(Bi )][1 − ¯π(Bj )] Antoine Rebecq Sampling the Twitter graph
  47. 47. Statistics and networks : motivations and methods Survey sampling Extending the sampling design Results and future work Snowball sampling Adaptive sampling Subsection 2 Adaptive sampling Antoine Rebecq Sampling the Twitter graph
  48. 48. Statistics and networks : motivations and methods Survey sampling Extending the sampling design Results and future work Snowball sampling Adaptive sampling Adaptive sampling In adaptive sampling, when (Thompson, [10]) Used in official statistics to measure number of drugs users or HIV-positive people Sampling design often compared to the video game “minesweeper” Antoine Rebecq Sampling the Twitter graph
  49. 49. Statistics and networks : motivations and methods Survey sampling Extending the sampling design Results and future work Snowball sampling Adaptive sampling Adaptive sampling Image from [11] Antoine Rebecq Sampling the Twitter graph
  50. 50. Statistics and networks : motivations and methods Survey sampling Extending the sampling design Results and future work Snowball sampling Adaptive sampling Adaptive sampling Once a unit bearing the characteristic of interest (i.e. a user who tweeted about the Star Wars trailer) is found, all its network (i.e. its friends and friends of friends, etc. who have tweeted about Star Wars) is included in the sample. Antoine Rebecq Sampling the Twitter graph
  51. 51. Statistics and networks : motivations and methods Survey sampling Extending the sampling design Results and future work Snowball sampling Adaptive sampling Adaptive sampling Estimator : ˆNC4 = K k=1 n∗ CkJk πgk where : K = number of networks y∗ k = total of Y in the network k n∗ Ck = Number of people with yk ≥ 1 in the network k Jk = 1{k ∈ C} πgk = probability that the initial sample intersects k Antoine Rebecq Sampling the Twitter graph
  52. 52. Statistics and networks : motivations and methods Survey sampling Extending the sampling design Results and future work Snowball sampling Adaptive sampling Adaptive sampling When using an adaptive design, it is often better to use the Rao-Blackwell of the previous estimate. It has a very simple closed form in the case of the adaptive stratified. ˆNC5 = n0 + K k=1 nr 1 − (1 − p)nr where : n0 = #s0 and s0 = ∪r {k ∈ s, δ(k, C) = 1} is the union of the sides of C. Antoine Rebecq Sampling the Twitter graph
  53. 53. Statistics and networks : motivations and methods Survey sampling Extending the sampling design Results and future work Snowball sampling Adaptive sampling Adaptive sampling - Variance ˆV( ˆNC4) = K k=1 K k =1 ykyk πgkk πgkk πgkπgk − 1 where : πgkk = 1 − πgk − πgk + (1 − p)ngk +ngk Antoine Rebecq Sampling the Twitter graph
  54. 54. Statistics and networks : motivations and methods Survey sampling Extending the sampling design Results and future work Snowball sampling Adaptive sampling Adaptive sampling - Variance Variance estimation for the Rao-Blackwell can be done by selecting m samples : ˆV( ˆNC5) = ˆV( ˆNC4) − 1 m − 1 m i=1 ( ˆNC5i − ˆNC4)2 Antoine Rebecq Sampling the Twitter graph
  55. 55. Statistics and networks : motivations and methods Survey sampling Extending the sampling design Results and future work Results Sample size Future work Section 4 Results and future work Antoine Rebecq Sampling the Twitter graph
  56. 56. Statistics and networks : motivations and methods Survey sampling Extending the sampling design Results and future work Results Sample size Future work Subsection 1 Results Antoine Rebecq Sampling the Twitter graph
  57. 57. Statistics and networks : motivations and methods Survey sampling Extending the sampling design Results and future work Results Sample size Future work Results Design n nscope n0 ˆNC ˆCV ˆDeff Bernoulli 20013 3946 354121 0.231 1.04 Stratified 20094 9832 316889 0.097 0.68 1-snowball 159957 73570 1000 331097 0.031 0.60 Antoine Rebecq Sampling the Twitter graph
  58. 58. Statistics and networks : motivations and methods Survey sampling Extending the sampling design Results and future work Results Sample size Future work Results Mean number of tweets @StarWars per user : 1.18 ± 0.07 Suggests that bots are not responsible for this very large number of tweets (see [4], [3]) ! Antoine Rebecq Sampling the Twitter graph
  59. 59. Statistics and networks : motivations and methods Survey sampling Extending the sampling design Results and future work Results Sample size Future work Subsection 2 Sample size Antoine Rebecq Sampling the Twitter graph
  60. 60. Statistics and networks : motivations and methods Survey sampling Extending the sampling design Results and future work Results Sample size Future work Snowball sampling - sample size Expected sample size ≈ 20000. Actual sample size : > 150000 ! Antoine Rebecq Sampling the Twitter graph
  61. 61. Statistics and networks : motivations and methods Survey sampling Extending the sampling design Results and future work Results Sample size Future work Adaptive sampling With our test subject (tweets @AmericanIdol), average network size was no greater than a few units (≈ 10000 tweets in the scope) With Star Wars (≈ 300000 tweets in the scope, with much less tweets per people), we couldn’t get to the end of every network ! Antoine Rebecq Sampling the Twitter graph
  62. 62. Statistics and networks : motivations and methods Survey sampling Extending the sampling design Results and future work Results Sample size Future work Subsection 3 Future work Antoine Rebecq Sampling the Twitter graph
  63. 63. Statistics and networks : motivations and methods Survey sampling Extending the sampling design Results and future work Results Sample size Future work Future work Control sample size Estimates and calibration for other statistics (centrality, clustering coefficients, path length, etc.) Take advantage of graph description using models Antoine Rebecq Sampling the Twitter graph
  64. 64. Statistics and networks : motivations and methods Survey sampling Extending the sampling design Results and future work Results Sample size Future work Auxiliary information for Barab´asi-Albert model : Degree Centrality Local clustering Mean path Max path Degree ++ - - - - Centrality - - - - Local clustering + + Mean path ++ Max path Antoine Rebecq Sampling the Twitter graph
  65. 65. Statistics and networks : motivations and methods Survey sampling Extending the sampling design Results and future work Results Sample size Future work Conclusion Thank you ! http://nc233.com/isnps2016 @nc233 Antoine Rebecq Sampling the Twitter graph
  66. 66. Statistics and networks: motivations and methods Survey sampling Extending the sampling design Results and future work Results Sample size Future work Paul Burkhardt and Chris Waring. An nsa big graph experiment. In presentation at the Carnegie Mellon University SDI/ISTC Seminar, Pittsburgh, Pa, 2013. Jean-Claude Deville and Carl-Erik S¨arndal. Calibration estimators in survey sampling. Journal of the American statistical Association, 87(418) :376–382, 1992. Emilio Ferrara. ”manipulation and abuse on social media” by emilio ferrara with ching-man au yeung as coordinator. SIGWEB Newsl., (Spring) :4 :1–4 :9, April 2015. Emilio Ferrara, Onur Varol, Clayton Davis, Filippo Menczer, and Alessandro Flammini. The rise of social bots. Antoine Rebecq Sampling the Twitter graph
  67. 67. Statistics and networks: motivations and methods Survey sampling Extending the sampling design Results and future work Results Sample size Future work arXiv preprint arXiv :1407.5225, 2014. Eric D Kolaczyk. Statistical analysis of network data. Springer, 2009. Seth A Myers, Aneesh Sharma, Pankaj Gupta, and Jimmy Lin. Information network or social network ? : the structure of the twitter follow graph. In Proceedings of the companion publication of the 23rd international conference on World wide web companion, pages 493–498. International World Wide Web Conferences Steering Committee, 2014. Jerzy Neyman. On the two different aspects of the representative method : the method of stratified sampling and the method of purposive selection. Antoine Rebecq Sampling the Twitter graph
  68. 68. Statistics and networks : motivations and methods Survey sampling Extending the sampling design Results and future work Results Sample size Future work Journal of the Royal Statistical Society, pages 558–625, 1934. Art B. Owen. Empirical likelihood. CRC press, 2010. Tiago P. Peixoto. The graph-tool python library. figshare, 2014. Steven K Thompson. Adaptive cluster sampling. Journal of the American Statistical Association, 85(412) :1050–1059, 1990. Steven K Thompson. Stratified adaptive cluster sampling. Biometrika, pages 389–397, 1991. Antoine Rebecq Sampling the Twitter graph

×