Power Law Distributions for Twitter Data

Power-Law Distributions in Twitter
Data
Conor Feeney
University of Limerick
Supervisor: Prof James Gleeson
April 15, 2016
Conor Feeney (UL) Power-Law Distributions in Twitter Data April 15, 2016 1 / 20

Outline
Twitter.
Power-law distributions.
Data collection.
Initial results.
Synthetic data & Kolmogorov-Smirnov test.
Comparing “poweRlaw” and Aaron Clauset’s code.
Twitter’s structure: changes in last three years.
Conclusion

Twitter
Twitter was founded in March 2006, and in a little
over 10 years has amassed over 300 million users
worldwide.
Due to this, a large amount of social media data
that can be obtained from Twitter.
The purposes of this paper is to examine the
potential presence of power-law distributions in
Twitter Data.

Introduction to Power-Law Distributions
A power-law probability distribution is a distribution
whose density function (or mass function in the
discrete case) has the form
p(x) = Cx−α
,
where C is a normalising constant.
Power-law distributions are deemed “heavy tail”.
This means that there is a greater chance of extreme
values than the Gaussian distrubution for example.

Plotting a Power-Law Distribution
Most common way is plotting the CCDF on a
log-log scale.
In theory it should have an approximate straight line
form.
Very few empirical phenomena obey power laws for
all values of x, generally the power-law is only
obeyed for values greater than some xmin.

Data Collection
First, a Twitter API (Application Program
Interface) was set up. This is because specific codes
generated by the app are needed to establish a
connection with R.
Data was collected using the “TwitteR” package in
R. This package is designed to work specifically with
Twitter, and their API.
Some difficulties at start because only specific
versions of R work with this package.

Results
Initially a data set of 1.89 × 105
(1) was obtained.
“poweRlaw” and Clauset’s code was run on it to
calculate α and xmin values.
Over the break in semesters, further data was
collected and stored. This took nearly two weeks to
fully collect.
Our new data set contained 8.3 × 105
(2) rows and
similar tests were carried out.
For second data set, we were required to use the
64-bit version of R for out results.

Results
Table: Results for two Twitter data sets.
Statistic Xmin(C) Xmin(R) α(C) α(R)
Followers (1) 360 329 2.2 2.19
Friends (1) 23251 9924 2.79 2.38
Rate of Posting (1) 0.99 0.99 2.04 2.04
Followers (2) 404 364 2.2 2.18
Friends (2) 51033 9986 3.08 2.37
Rate of Posting (2) 0.35 xx 2.01 xx

Figure: CCDF for the number of followers k of a random sample of
8.3 × 105
Twitter users.

Figure: Plot of the CCDF for the number of friends for users. Notice
R and Clauset diﬀer in outputted values, with R deviating from the
data the further along the tail.

Figure: Plot of the CCDF for the rate of posting for users. R failed
to produce an output so the black line represents the output from
Aaron Clauset’s code.

Synthetic Data
Synthetic data is any production data applicable to
a given situation that is not obtained by direct
measurement.
Synthetic data is generated to meet speciﬁc needs
or certain conditions that may not be found in the
original, real data.
The method that this paper utilised is known as
Inverse Transform Sampling.

Synthetic Data
Needed to create synthetic data sets that follow
power-law distributions for our various samples.
They had to have α values equal to our empirical
data.
This was needed to perform the KS test.

Kolmogorov-Smirnov Test Results
Necessary to show that the power-law model is a
good ﬁt for data.
Ran on two random samples, size 2 × 104
and
4 × 104
.
The results from these samples gave p-values that
told us that the data could be drawn from the
power-law model.
This, however, was not true for the larger sample’s
followers count, gave a p-value of around .08.

Sample Results
Table: Results for two Twitter data sets.
Statistic α(C) α(R) P-Val(C) P-Val(R)
Followers (A) 2.18 2.17 0.405 0.2213
Friends (A) 3 2.92 0.506 0.3455
Rate of Posting (A) 2.03 2.05 0.96 0.746
Followers (B) 2.2 2.19 0.103 0.086
Friends (B) 3.07 2.95 0.855 0.438
Rate of Posting (B) 2.03 2.01 0.427 0.45

A Comparison of R and Clauset’s codes.
R’s code nearly systematically calculated an xmin
that was less than Clauset, leading to less accurate
αs.
“poweRlaw” gave p-values less than Clauset’s
values.
While it is okay to use, borderline values should be
doubled checked using additional resources.

Comparing 2013 Twitter to 2016 Twitter
Data set from a diﬀerent paper was obtained.
Data was collected in 2013. We used it for
comparison purposes.
Only data was a users number of followers, but had
8.2 × 105
users.
Had a similar α values, 2.13 vs 2.2.

1e+00 1e+02 1e+04 1e+06
1e−051e−01
x
CCDF
Figure: CCDF’s of the 2013 data set plotted alongside the 2016 data
set.

Conclusion
Twitter & Power-law Distributions.
Data Collection.
Initial Results.
Synthetic Data & Kolmogorov-Smirnov Test.
Comparing “poweRlaw” and Aaron Clauset’s Code.
Twitter’s Structure: Changes in Last Three Years.

Thank You for Listening
Questions?

Power Law Distributions for Twitter Data

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Power Law Distributions for Twitter Data

Similar to Power Law Distributions for Twitter Data (10)

Power Law Distributions for Twitter Data