Sampling the Twitter graph

Stats on social networks / Twitter
Survey sampling
Extending the sampling design
Results and future work
Using sampling methods to estimate rare stats on
Twitter’s graph
Antoine Rebecq
INSEE - Universit´e Paris X
12/14/15
Antoine Rebecq Sampling the Twitter graph

Survey sampling
Sommaire
1 Stats on social networks / Twitter
Motivation
Towards design-based estimation
2 Survey sampling
Estimates
Sampling design
3 Extending the sampling design
Snowball sampling
Adaptive sampling
4 Results and future work
Results
Sample size
Future work

Survey sampling
Motivation
Section 1

Survey sampling
Motivation
Subsection 1
Motivation

Survey sampling
Motivation
Big data begets big graph
Twitter in 2013
Image from [2]

Survey sampling
Motivation
Studies - Twitter
A large range of studies used Twitter data (Computer Science,
Sociology, Psychology, etc.)
Data on Twitter can be collected via :
The REST API (limited number of queries - queries can be on
anything)
The Streaming API (Only 1% of tweets matching some
criteria)
The Firehose (Unlimited access. Expensive)

Survey sampling
Motivation
The Twitter graph
The Twitter graph ([7]) :
Is undirected
Degree distribution is heavy-tailed

Survey sampling
Motivation
The Twitter graph
Has small path lengths

Survey sampling
Motivation
Subsection 2

Survey sampling
Motivation
Model-based estimation :
Scale-free networks, Barab´asi-Albert ([1])
Small-world networks, Watts-Strogatz ([13])

Survey sampling
Motivation
Very little exists about design-based statistical inference on
networks (Kolaczyk 2009 , [6])
We try survey sampling methods used in oﬃcial Statistics
Institutes to make design-based inference about “big graphs”

Survey sampling
Motivation
Example : Star Wars : The Force Awakens
Star Wars : The Force Awakens

Survey sampling
Motivation
Example : “Star Wars, The Force Awakens”
Let’s write :
yk = Number of tweets @starwars by user k
between 10/29/15, 7 :48 - 10 :48 PM EST
zk = 1{yk ≥ 1}
Goal : estimate NC = T(Z)
Additionally, we write : nC =
k∈s
zk

Survey sampling
Estimates
Sampling design
Section 2
Survey sampling

Survey sampling
Estimates
Sampling design
Subsection 1
Estimates

Survey sampling
Estimates
Sampling design
Horvitz-Thompson estimator
Population U : vertices of the Twitter graph.
Assign all k ∈ U an inclusion probability P(k ∈ s) = πk

Survey sampling
Estimates
Sampling design
Classic unbiased estimator for totals and means :
Horvitz-Thompson
ˆT(Y )HT =
k∈s
yk
πk
ˆ¯y =
1
N
k∈s
yk
πk

Survey sampling
Estimates
Sampling design
Variance of the Horvitz-Thompson estimator depends on the ﬁrst
and second-order inclusion probabilities :
πk = P(k ∈ s)
πkl = P(k, l ∈ s)
V( ˆT(Y )HT ) =
k∈U l∈U
(πkl − πkπl )
yk
πk
yl
πl

Survey sampling
Estimates
Sampling design
Calibrated estimator
Deville-Sarndal, 1992 ([3]). Modiﬁcation of the Horvitz-Thompson
estimator to take auxiliary information into account. For example :
T(Y ) = Number of tweets @StarWars
N = Number of users in scope
Structure of number of followers
Number of veriﬁed users
. . .
Very similar to empirical likelihood methods ([9]).

Survey sampling
Estimates
Sampling design
Subsection 2
Sampling design

Survey sampling
Estimates
Sampling design
Sampling frame
Each Twitter user is assigned a unique id. When a new user is
created, the id that is assigned to it is greater than the last
previous id.
But, not all ids match an existing user (≈ 3.1 · 109 ids as of
October 2015), which means our frame over-covers the
population. Over-coverage can be corrected either by using a
Horvitz-Thompson or Hajek estimator (see [10]).

Survey sampling
Estimates
Sampling design
Sampling design : Bernoulli
Poisson sampling : For each k ∈ U , run a πk-Bernoulli experiment
to decide whether to include unit k in the sample.
Bernoulli sampling : ∀k, πk = p
Sampling design of non-ﬁxed sample size. We set the expected
sample size to 20000.

Survey sampling
Estimates
Sampling design
Sampling design : Stratiﬁed Bernoulli
We write : U = U1 U2 (h = 1, 2 being called “strata”) and
draw two independant Bernoulli samples in U1 and U2.
Here :
U1 = Followers of oﬃcial @starwars account
U2 = Rest of Twitter users

Survey sampling
Estimates
Sampling design
Sampling design : Neyman allocation
Optimal variance of the Horvitz-Thompson estimator is obtained
for (Neyman, [8]) :
nh =
NhS2
h
h
NhS2
h
Given the expected values, we set :
n1 = 9700
n2 = 10300

Survey sampling
Estimates
Sampling design
Sampling design : Stratiﬁed Bernoulli
Estimators for the two “simple” designs :
ˆNC1 =
nC
p
ˆNC2 =
N1
n1
nC1 +
N − N1
n2
nC2

Survey sampling
Estimates
Sampling design
Variance estimators
ˆV( ˆT(Y ))1 =
k∈s
(1 − p)yk
p2

Survey sampling
Snowball sampling
Adaptive sampling
Section 3

Survey sampling
Snowball sampling
Adaptive sampling
Snowball sampling
From now on, our sampling designs will include extensions :
s = s0 ∪ sext
s0 is still selected using stratiﬁed Bernoulli, but with expected
sample size of 1000, so that the expected sample size of s is more
or less 20000.

Survey sampling
Snowball sampling
Adaptive sampling
Subsection 1
Snowball sampling

Survey sampling
Snowball sampling
Adaptive sampling
Snowball sampling
Population U

Survey sampling
Snowball sampling
Adaptive sampling
Snowball sampling
Initial sample s0

Survey sampling
Snowball sampling
Adaptive sampling
Snowball sampling
One stage snowball extension s = A(s0)

Survey sampling
Snowball sampling
Adaptive sampling
Snowball sampling
Formally, we write :
Bi = {i} ∪ {j ∈ V , Eji = ∅}
Ai = {i} ∪ {j ∈ V , Eij = ∅}
s = A(s0)

Survey sampling
Snowball sampling
Adaptive sampling
Snowball sampling
ˆNC3 =
k∈s
zi
1 − ¯π(Bi )
where :
¯π(Bi ) = P(Bi ⊂ ¯s)
=
k∈Bi
(1 − P(k ∈ s))
= q
#(Bi ∩U1)
S1 · q
#(Bi ∩U2)
S2

Survey sampling
Snowball sampling
Adaptive sampling
Snowball sampling
ˆV( ˆNC3) =
i∈s j∈s
zi zj
¯π(Bi ∪ Bj )
γij
where :
γij =
¯π(Bi ∪ Bj ) − ¯π(Bi )¯π(Bj )
[1 − ¯π(Bi )][1 − ¯π(Bj )]

Survey sampling
Snowball sampling
Adaptive sampling
Subsection 2
Adaptive sampling

Survey sampling
Snowball sampling
Adaptive sampling
Adaptive sampling
In adaptive sampling, when (Thompson, [11])
Used in oﬃcial statistics to measure number of drugs users or
HIV-positive people
Sampling design often compared to the video game
“minesweeper”

Survey sampling
Snowball sampling
Adaptive sampling
Adaptive sampling
Image from [12]

Survey sampling
Snowball sampling
Adaptive sampling
Adaptive sampling
Once a unit bearing the characteristic of interest (i.e. a user who
tweeted about the Star Wars trailer) is found, all its network (i.e.
its friends and friends of friends, etc. who have tweeted about Star
Wars) is included in the sample.

Survey sampling
Snowball sampling
Adaptive sampling
Adaptive sampling
Estimator :
ˆNC4 =
K
k=1
n∗
CkJk
πgk
where :
K = number of networks
y∗
k = total of Y in the network k
n∗
Ck
= Number of people with yk ≥ 1in the network k
Jk = 1{k ∈ C}
πgk = probability that the initial sample intersects k

Survey sampling
Snowball sampling
Adaptive sampling
Adaptive sampling
When using an adaptive design, it is often better to use the
Rao-Blackwell of the previous estimate. It has a very simple closed
form in the case of the adaptive stratiﬁed.
ˆNC5 = n0
+
K
k=1
nr
1 − (1 − p)nr
where : n0 = #s0 and s0 = ∪r {k ∈ s, δ(k, C) = 1} is the union of
the sides of C.

Survey sampling
Results
Sample size
Future work
Section 4

Survey sampling
Results
Sample size
Future work
Subsection 1
Results

Survey sampling
Results
Sample size
Future work
Results
Design n nscope n0
ˆNC
ˆCV ˆDeﬀ
Bernoulli 20013 3946 354121 0.231 1.04
Stratiﬁed 20094 9832 316889 0.097 0.68
1-snowball 159957 73570 1000 331097 0.031 0.60

Survey sampling
Results
Sample size
Future work
Results
Mean number of tweets @StarWars per user : 1.18 ± 0.07
Suggests that bots are not responsible for this very large number of
tweets (see [5], [4]) !

Survey sampling
Results
Sample size
Future work
Subsection 2
Sample size

Survey sampling
Results
Sample size
Future work
Snowball sampling - sample size
Expected sample size ≈ 20000.
Actual sample size : > 150000 !

Survey sampling
Results
Sample size
Future work
Adaptive sampling
With our test subject (tweets @AmericanIdol), average network
size was no greater than a few units (≈ 10000 tweets in the scope)
With Star Wars (≈ 300000 tweets in the scope, with much less
tweets per people), we couldn’t get to the end of every network !

Survey sampling
Results
Sample size
Future work
Subsection 3
Future work

Survey sampling
Results
Sample size
Future work
Future work
Control sample size
Estimates and calibration on graph totals (centrality,
clustering coeﬃcients, path length, etc.)

Survey sampling
Results
Sample size
Future work
Conclusion
Thank you !
http://nc233.com/cmstatistics2015
@nc233

Survey sampling
Results
Sample size
Future work
Albert-László Barabási and Réka Albert.
Emergence of scaling in random networks.
science, 286(5439) :509–512, 1999.
Paul Burkhardt and Chris Waring.
An nsa big graph experiment.
In presentation at the Carnegie Mellon University SDI/ISTC
Seminar, Pittsburgh, Pa, 2013.
Jean-Claude Deville and Carl-Erik Särndal.
Calibration estimators in survey sampling.
Journal of the American statistical Association,
87(418) :376–382, 1992.
Emilio Ferrara.
”manipulation and abuse on social media” by emilio ferrara
with ching-man au yeung as coordinator.
SIGWEB Newsl., (Spring) :4 :1–4 :9, April 2015.

Survey sampling
Results
Sample size
Future work
Emilio Ferrara, Onur Varol, Clayton Davis, Filippo Menczer,
and Alessandro Flammini.
The rise of social bots.
arXiv preprint arXiv :1407.5225, 2014.
Eric D Kolaczyk.
Statistical analysis of network data.
Springer, 2009.
Seth A Myers, Aneesh Sharma, Pankaj Gupta, and Jimmy Lin.
Information network or social network ? : the structure of the
twitter follow graph.
In Proceedings of the companion publication of the 23rd
international conference on World wide web companion, pages
493–498. International World Wide Web Conferences Steering
Committee, 2014.
Jerzy Neyman.

Survey sampling
Results
Sample size
Future work
On the two different aspects of the representative method :
the method of stratified sampling and the method of purposive
selection.
Journal of the Royal Statistical Society, pages 558–625, 1934.
Art B. Owen.
Empirical likelihood.
CRC press, 2010.
Olivier Sautory.
Les enjeux méthodologiques liés à l’usage de bases de sondage
imparfaites.
Steven K Thompson.
Adaptive cluster sampling.
Journal of the American Statistical Association,
85(412) :1050–1059, 1990.
Steven K Thompson.

Survey sampling
Results
Sample size
Future work
Stratiﬁed adaptive cluster sampling.
Biometrika, pages 389–397, 1991.
Duncan J Watts and Steven H Strogatz.
Collective dynamics of ‘small-world’networks.
nature, 393(6684) :440–442, 1998.

Sampling the Twitter graph

Recommended

Recommended

More Related Content

What's hot

What's hot (6)

Similar to Sampling the Twitter graph

Similar to Sampling the Twitter graph (20)

Recently uploaded

Recently uploaded (20)

Sampling the Twitter graph