Successfully reported this slideshow.
Upcoming SlideShare
×

Sampling graphs efficiently - MAD Stat (TSE)

Slides used for presentation "Sampling graphs efficiently" at Toulouse School of Economics, 3/23/2017

• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

Sampling graphs efficiently - MAD Stat (TSE)

1. 1. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data Sampling graphs eﬃciently: model assisted designs and application to Twitter data Antoine Rebecq Universit´e Paris X - INSEE 3/23/17 Antoine Rebecq Sampling designs for graphs
2. 2. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data 1 Statistics and networks Graphs and stats Methods - algorithms - models 2 Survey sampling Estimates Use of auxiliary information 3 Extending the sampling design Snowball sampling Adaptive sampling 4 Application to Twitter data The problem Results Model-assisted sampling Antoine Rebecq Sampling designs for graphs
3. 3. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data Graphs and stats Methods - algorithms - models Section 1 Statistics and networks Antoine Rebecq Sampling designs for graphs
4. 4. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data Graphs and stats Methods - algorithms - models Subsection 1 Graphs and stats Antoine Rebecq Sampling designs for graphs
5. 5. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data Graphs and stats Methods - algorithms - models Graphs Graph G, set of vertices and edges : G = (V , E) Antoine Rebecq Sampling designs for graphs
6. 6. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data Graphs and stats Methods - algorithms - models Directed graphs Antoine Rebecq Sampling designs for graphs
7. 7. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data Graphs and stats Methods - algorithms - models Statistics of interest - graphs Size Degree Centrality Clustering Communities . . . Antoine Rebecq Sampling designs for graphs
8. 8. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data Graphs and stats Methods - algorithms - models Degree dv = number of edges incident upon vertex v Antoine Rebecq Sampling designs for graphs
9. 9. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data Graphs and stats Methods - algorithms - models Degree / scale-free property Antoine Rebecq Sampling designs for graphs
10. 10. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data Graphs and stats Methods - algorithms - models Path lengths Antoine Rebecq Sampling designs for graphs
11. 11. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data Graphs and stats Methods - algorithms - models Centrality Measure of “importance” of a node. Examples : Google Pagerank, betweenness centrality (number of times a node acts as a bridge along the shortest path between two other nodes) Antoine Rebecq Sampling designs for graphs
12. 12. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data Graphs and stats Methods - algorithms - models Betweenness centrality Antoine Rebecq Sampling designs for graphs
13. 13. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data Graphs and stats Methods - algorithms - models Clustering Global clustering coeﬃcient = 3 · number of triangles number of connected triplets Local clustering coeﬃcient of a vertex = how close its neighbours are to being a clique (complete graph). Antoine Rebecq Sampling designs for graphs
14. 14. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data Graphs and stats Methods - algorithms - models Local clustering coeﬃcient Antoine Rebecq Sampling designs for graphs
15. 15. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data Graphs and stats Methods - algorithms - models The rise of “big graphs” Rise of “big graphs” Antoine Rebecq Sampling designs for graphs
16. 16. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data Graphs and stats Methods - algorithms - models The rise of “big graphs” Example : The Graph500 benchmark (http://www.graph500.org). Size of data sets up to 1.1 PB adjacency list (human connectome size) Antoine Rebecq Sampling designs for graphs
17. 17. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data Graphs and stats Methods - algorithms - models Subsection 2 Methods - algorithms - models Antoine Rebecq Sampling designs for graphs
18. 18. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data Graphs and stats Methods - algorithms - models Methods for graph statistics Algorithms (computer science, “big data”) Model-based estimation Sampling (“Design-based estimation”) Antoine Rebecq Sampling designs for graphs
19. 19. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data Graphs and stats Methods - algorithms - models Methods for graph statistics Antoine Rebecq Sampling designs for graphs
20. 20. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data Graphs and stats Methods - algorithms - models Computer science methods Eﬃcient algorithms (speed / memory). Antoine Rebecq Sampling designs for graphs
21. 21. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data Graphs and stats Methods - algorithms - models Computer science methods Eﬃcient algorithms (speed / memory). Sometimes require sampling. Antoine Rebecq Sampling designs for graphs
22. 22. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data Graphs and stats Methods - algorithms - models Model-based estimation Famous graph models : Erd˝os-R´enyi Price / Barab´asi-Albert (High tailed degree distribution) Watts-Strogatz / “small-world” (short path lengths) Stochastic block models (communities) Images from [8] Antoine Rebecq Sampling designs for graphs
23. 23. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data Graphs and stats Methods - algorithms - models Model-based estimation : Erd˝os-R´enyi (“random graphs”) Antoine Rebecq Sampling designs for graphs
24. 24. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data Graphs and stats Methods - algorithms - models Model-based estimation : Barab´asi-Albert (“preferential attachment”) Antoine Rebecq Sampling designs for graphs
25. 25. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data Graphs and stats Methods - algorithms - models Model-based estimation : Watts-Strogatz (“small world”) Antoine Rebecq Sampling designs for graphs
26. 26. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data Graphs and stats Methods - algorithms - models Model-based estimation : Stochastic Block Models Antoine Rebecq Sampling designs for graphs
27. 27. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data Graphs and stats Methods - algorithms - models Sampling / Design-based estimation Sampling : select a few vertices/edges and compute estimators using sample data. Very little exists about design-based statistical inference on networks (Kolaczyk 2009 , [5]) We try survey sampling methods used in oﬃcial Statistics Institutes to make design-based inference about “big graphs” Antoine Rebecq Sampling designs for graphs
28. 28. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data Estimates Use of auxiliary information Section 2 Survey sampling Antoine Rebecq Sampling designs for graphs
29. 29. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data Estimates Use of auxiliary information Subsection 1 Estimates Antoine Rebecq Sampling designs for graphs
30. 30. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data Estimates Use of auxiliary information Horvitz-Thompson estimator Population U (here vertices of the graph). Assign all k ∈ U an inclusion probability P(k ∈ s) = πk Antoine Rebecq Sampling designs for graphs
31. 31. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data Estimates Use of auxiliary information Horvitz-Thompson estimator Classic unbiased estimator for totals and means : Horvitz-Thompson ˆT(Y )HT = k∈s yk πk ˆ¯y = 1 N k∈s yk πk Antoine Rebecq Sampling designs for graphs
32. 32. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data Estimates Use of auxiliary information Horvitz-Thompson estimator Variance of the Horvitz-Thompson estimator depends on the ﬁrst and second-order inclusion probabilities : πk = P(k ∈ s) πkl = P(k, l ∈ s) V( ˆT(Y )HT ) = k∈U l∈U (πkl − πkπl ) yk πk yl πl Antoine Rebecq Sampling designs for graphs
33. 33. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data Estimates Use of auxiliary information Bernoulli sampling Poisson sampling : For each k ∈ U , run a πk-Bernoulli experiment to decide whether to include unit k in the sample. Bernoulli sampling : ∀k, πk = p Antoine Rebecq Sampling designs for graphs
34. 34. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data Estimates Use of auxiliary information Subsection 2 Use of auxiliary information Antoine Rebecq Sampling designs for graphs
35. 35. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data Estimates Use of auxiliary information Auxiliary information If πk ∝ yk then V( ˆT(Y )HT ) = 0 In practice, use auxiliary variable : X which is well correlated to Y . Antoine Rebecq Sampling designs for graphs
36. 36. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data Estimates Use of auxiliary information Stratiﬁed sampling We write : U = U1 U2 . . . UH and draw independant samples in each Uh. Strata should be formed so that intra dispersion of yk is the lowest possible. Antoine Rebecq Sampling designs for graphs
37. 37. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data Estimates Use of auxiliary information Stratiﬁed sampling : Neyman allocation Given a set of strata and a sample size n, optimal variance is obtained for : nh = NhS2 h h NhS2 h Antoine Rebecq Sampling designs for graphs
38. 38. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data Estimates Use of auxiliary information Calibrated estimator Deville-Sarndal, 1992 ([2]). Modiﬁcation of the Horvitz-Thompson estimator to take auxiliary information into account. Very similar to empirical likelihood methods ([7]). Computing variances for calibrated estimators is easy. Antoine Rebecq Sampling designs for graphs
39. 39. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data Snowball sampling Adaptive sampling Section 3 Extending the sampling design Antoine Rebecq Sampling designs for graphs
40. 40. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data Snowball sampling Adaptive sampling Oﬃcial statistics Measuring “hidden populations” Antoine Rebecq Sampling designs for graphs
41. 41. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data Snowball sampling Adaptive sampling Community structure When trying to measure the size of a community ( ˆNC ), use of edges as auxiliary variables. Antoine Rebecq Sampling designs for graphs
42. 42. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data Snowball sampling Adaptive sampling Snowball sampling From now on, our sampling designs will include extensions : s = s0 ∪ sext Antoine Rebecq Sampling designs for graphs
43. 43. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data Snowball sampling Adaptive sampling Subsection 1 Snowball sampling Antoine Rebecq Sampling designs for graphs
44. 44. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data Snowball sampling Adaptive sampling Snowball sampling Population U Antoine Rebecq Sampling designs for graphs
45. 45. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data Snowball sampling Adaptive sampling Snowball sampling Initial sample s0 Antoine Rebecq Sampling designs for graphs
46. 46. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data Snowball sampling Adaptive sampling Snowball sampling One stage snowball extension s = A(s0) Antoine Rebecq Sampling designs for graphs
47. 47. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data Snowball sampling Adaptive sampling Snowball sampling Formally, we write : Bi = {i} ∪ {j ∈ V , Eji = ∅} Ai = {i} ∪ {j ∈ V , Eij = ∅} s = A(s0) Antoine Rebecq Sampling designs for graphs
48. 48. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data Snowball sampling Adaptive sampling Snowball sampling ˆNC3 = k∈s zi 1 − ¯π(Bi ) where : ¯π(Bi ) = P(Bi ⊂ ¯s) = k∈Bi (1 − P(k ∈ s)) Antoine Rebecq Sampling designs for graphs
49. 49. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data Snowball sampling Adaptive sampling Snowball sampling ˆV( ˆNC3) = i∈s j∈s zi zj ¯π(Bi ∪ Bj ) γij where : γij = ¯π(Bi ∪ Bj ) − ¯π(Bi )¯π(Bj ) [1 − ¯π(Bi )][1 − ¯π(Bj )] Antoine Rebecq Sampling designs for graphs
50. 50. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data Snowball sampling Adaptive sampling Subsection 2 Adaptive sampling Antoine Rebecq Sampling designs for graphs
51. 51. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data Snowball sampling Adaptive sampling Adaptive sampling Adaptive sampling (Thompson, [9]) Used in oﬃcial statistics to measure number of drugs users or HIV-positive people Sampling design often compared to the video game “minesweeper” Antoine Rebecq Sampling designs for graphs
52. 52. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data Snowball sampling Adaptive sampling Adaptive sampling Image from [10] Antoine Rebecq Sampling designs for graphs
53. 53. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data Snowball sampling Adaptive sampling Adaptive sampling Once a unit bearing the characteristic of interest is found, all its network is included in the sample. Antoine Rebecq Sampling designs for graphs
54. 54. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data Snowball sampling Adaptive sampling Adaptive sampling Estimator : ˆNC4 = K k=1 n∗ CkJk πgk where : K = number of networks y∗ k = total of Y in the network k n∗ Ck = Number of people with yk ≥ 1 in the network k Jk = 1{k ∈ C} πgk = probability that the initial sample intersects k Antoine Rebecq Sampling designs for graphs
55. 55. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data Snowball sampling Adaptive sampling Adaptive sampling When using an adaptive design, it is often better to use the Rao-Blackwell of the previous estimate. It has a very simple closed form in the case of the adaptive stratiﬁed. ˆNC5 = n0 + K k=1 nr 1 − (1 − p)nr where : n0 = #s0 and s0 = ∪r {k ∈ s, δ(k, C) = 1} is the union of the sides of C. Antoine Rebecq Sampling designs for graphs
56. 56. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data Snowball sampling Adaptive sampling Adaptive sampling - Variance ˆV( ˆNC4) = K k=1 K k =1 ykyk πgkk πgkk πgkπgk − 1 where : πgkk = 1 − πgk − πgk + (1 − p)ngk +ngk Antoine Rebecq Sampling designs for graphs
57. 57. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data Snowball sampling Adaptive sampling Adaptive sampling - Variance Variance estimation for the Rao-Blackwell can be done by selecting m samples : ˆV( ˆNC5) = ˆV( ˆNC4) − 1 m − 1 m i=1 ( ˆNC5i − ˆNC4)2 Antoine Rebecq Sampling designs for graphs
58. 58. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data The problem Results Model-assisted sampling Section 4 Application to Twitter data Antoine Rebecq Sampling designs for graphs
59. 59. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data The problem Results Model-assisted sampling Subsection 1 The problem Antoine Rebecq Sampling designs for graphs
60. 60. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data The problem Results Model-assisted sampling The Twitter graph Twitter in 2013 Image from [1] Antoine Rebecq Sampling designs for graphs
61. 61. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data The problem Results Model-assisted sampling The Twitter API Access to the Twitter data through an API (Application programming interface), which limits the number of calls per hour. Antoine Rebecq Sampling designs for graphs
62. 62. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data The problem Results Model-assisted sampling Example : Star Wars : The Force Awakens How many (real) users behind tweets talking about the new Star Wars movie ? Antoine Rebecq Sampling designs for graphs
63. 63. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data The problem Results Model-assisted sampling Example : “Star Wars, The Force Awakens” Let’s write : yk = Number of tweets @starwars by user k between 10/29/15, 7 :48 - 10 :48 PM EST zk = 1{yk ≥ 1} Goal : estimate NC = T(Z) Additionally, we write : nC = k∈s zk Antoine Rebecq Sampling designs for graphs
64. 64. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data The problem Results Model-assisted sampling The Twitter graph The Twitter graph ([6]) : Is directed Degree distribution is heavy-tailed Antoine Rebecq Sampling designs for graphs
65. 65. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data The problem Results Model-assisted sampling The Twitter graph Has small path lengths Antoine Rebecq Sampling designs for graphs
66. 66. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data The problem Results Model-assisted sampling Sampling designs 1 Bernoulli sample 2 Stratiﬁed Bernoulli 3 Snowball over the stratiﬁed Bernoulli 4 Adaptive over the stratiﬁed Bernoulli 5 (Rao-blackwell of the adaptive estimator) Antoine Rebecq Sampling designs for graphs
67. 67. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data The problem Results Model-assisted sampling Stratiﬁcation U1 = Followers of oﬃcial @starwars account U2 = Rest of Twitter users Antoine Rebecq Sampling designs for graphs
68. 68. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data The problem Results Model-assisted sampling Stratiﬁcation : Neyman allocation Given some preliminary exploratory data, we get (for n = 2000) : n1 = 9700 n2 = 10300 Antoine Rebecq Sampling designs for graphs
69. 69. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data The problem Results Model-assisted sampling Sample size - extension Size of s0 : 1000 (so that total sample size, with extensions, would be about n = 20000). Antoine Rebecq Sampling designs for graphs
70. 70. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data The problem Results Model-assisted sampling Calibration variables N = Number of users in scope Structure of number of followers Number of veriﬁed users . . . Antoine Rebecq Sampling designs for graphs
71. 71. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data The problem Results Model-assisted sampling Estimators ˆNC1 = nC p ˆNC2 = N1 n1 nC1 + N − N1 n2 nC2 ˆNC3 = k∈s zi 1 − ¯π(Bi ) ˆNC4 = K k=1 n∗ CkJk πgk ˆNC5 = n0 + K k=1 nr 1 − (1 − p)nr Antoine Rebecq Sampling designs for graphs
72. 72. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data The problem Results Model-assisted sampling Exclusion probabilities ¯π(Bi ) = P(Bi ⊂ ¯s) = k∈Bi (1 − P(k ∈ s)) = q #(Bi ∩U1) S1 · q #(Bi ∩U2) S2 Antoine Rebecq Sampling designs for graphs
73. 73. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data The problem Results Model-assisted sampling Subsection 2 Results Antoine Rebecq Sampling designs for graphs
74. 74. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data The problem Results Model-assisted sampling Results Design n nscope n0 ˆNC ˆCV ˆDeﬀ Bernoulli 20013 3946 354121 0.231 1.04 Stratiﬁed 20094 9832 316889 0.097 0.68 1-snowball 159957 73570 1000 331097 0.031 0.60 Antoine Rebecq Sampling designs for graphs
75. 75. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data The problem Results Model-assisted sampling Results Mean number of tweets @StarWars per user : 1.18 ± 0.07 Suggests that bots are not responsible for this very large number of tweets (see [4], [3]) ! Adaptive sampling did not converge. Antoine Rebecq Sampling designs for graphs
76. 76. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data The problem Results Model-assisted sampling Subsection 3 Model-assisted sampling Antoine Rebecq Sampling designs for graphs
77. 77. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data The problem Results Model-assisted sampling Auxiliary information for Barab´asi-Albert model : Degree Centrality Local clustering Mean path Max path Degree ++ - - - - Centrality - - - - Local clustering + + Mean path ++ Max path Antoine Rebecq Sampling designs for graphs
78. 78. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data The problem Results Model-assisted sampling Future work Combine all these (optimal allocations, etc.) Asymptotics Antoine Rebecq Sampling designs for graphs
79. 79. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data The problem Results Model-assisted sampling Conclusion Thank you ! http://nc233.com/madstat2017 @nc233 Antoine Rebecq Sampling designs for graphs
80. 80. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data The problem Results Model-assisted sampling Paul Burkhardt and Chris Waring. An nsa big graph experiment. In presentation at the Carnegie Mellon University SDI/ISTC Seminar, Pittsburgh, Pa, 2013. Jean-Claude Deville and Carl-Erik S¨arndal. Calibration estimators in survey sampling. Journal of the American statistical Association, 87(418) :376–382, 1992. Emilio Ferrara. ”manipulation and abuse on social media” by emilio ferrara with ching-man au yeung as coordinator. SIGWEB Newsl., (Spring) :4 :1–4 :9, April 2015. Emilio Ferrara, Onur Varol, Clayton Davis, Filippo Menczer, and Alessandro Flammini. The rise of social bots. Antoine Rebecq Sampling designs for graphs
81. 81. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data The problem Results Model-assisted sampling arXiv preprint arXiv :1407.5225, 2014. Eric D Kolaczyk. Statistical analysis of network data. Springer, 2009. Seth A Myers, Aneesh Sharma, Pankaj Gupta, and Jimmy Lin. Information network or social network ? : the structure of the twitter follow graph. In Proceedings of the companion publication of the 23rd international conference on World wide web companion, pages 493–498. International World Wide Web Conferences Steering Committee, 2014. Art B. Owen. Empirical likelihood. CRC press, 2010. Tiago P. Peixoto. Antoine Rebecq Sampling designs for graphs
82. 82. Statistics and networks Survey sampling Extending the sampling design Application to Twitter data The problem Results Model-assisted sampling The graph-tool python library. ﬁgshare, 2014. Steven K Thompson. Adaptive cluster sampling. Journal of the American Statistical Association, 85(412) :1050–1059, 1990. Steven K Thompson. Stratiﬁed adaptive cluster sampling. Biometrika, pages 389–397, 1991. Antoine Rebecq Sampling designs for graphs