We start by very briefly introducing the Twitter platform and detailing the demographics of the users and the biases they introduce. The relationship between geography, mobility and social network properties will be described using the Twitter service as a case study. Finally, tutorial attendees will get the chance to review the most seminal works in the area where spatial and geographic perspectives are highlighted.
12. www.bgoncalves.com@bgoncalves
API Basics https://dev.twitter.com/docs
• The twitter module provides the oauth interface. We just need to provide the right
credentials.
• Best to keep the credentials in a dict and parametrize our calls with the dict key. This way
we can switch between different accounts easily.
• .Twitter(auth) takes an OAuth instance as argument and returns a Twitter object that we
can use to interact with the API
• Twitter methods mimic API structure
• 4 basic types of objects:
• Tweets
• Users
• Entities
• Places
https://github.com/bmtgoncalves/Mining-Georeferenced-Data
13. www.bgoncalves.com@bgoncalves
User Timeline https://dev.twitter.com/docs/api/1.1/get/statuses/user_timeline
• .statuses.user_timeline() returns a set of tweets posted by a single user
• Important options:
• include_rts=‘true’ to Include retweets by this user
• count=200 number of tweets to return in each call
• trim_user=‘true’ to not include the user information (save bandwidth and processing
time)
• max_id=1234 to include only tweets with an id lower than 1234
• Returns at most 200 tweets in each call. Can get all of a users tweets (up to 3200) with
multiple calls using max_id
15. www.bgoncalves.com@bgoncalves
Streaming Geocoded data
• The Streaming api provides realtime data, subject to filters
• Use TwitterStream instead of Twitter object (.TwitterStream(auth=twitter_api.auth))
• .status.filter(track=q) will return tweets that match the query q in real time
• Returns generator that you can iterate over
• .status.filter(locations=bb) will return tweets that occur within the bounding box bb in
real time
• bb is a comma separated pair of lat/lon coordinates.
• -180,-90,180,90 - World
• -74,40,-73,41 - NYC
https://github.com/bmtgoncalves/Mining-Georeferenced-Data
25. www.bgoncalves.com@bgoncalves
Demographics ICWSM’11, 375 (2011)
users who we could infer a gender for, based on their name
and the list previously described. We do so by comparing
the first word of their self-reported name to the gender list.
We observe that there exists a match for 64.2% of the users.
Moreover, we find a strong bias towards male users: Fully
71.8% of the the users who we find a name match for had a
male name.
0
0.2
0.4
0.6
0.8
1
2007-01 2007-07 2008-01 2008-07 2009-01 2009-07
FractionofJoiningUsers
whoareMale
Date
Figure 3: Gender of joining users over time, binned into
groups of 10,000 joining users (note that the join rate in-
creases substantially). The bias towards male users is ob-
served to be decreasing over time.
each last name with over 100 individuals in the U.S.
ing the 2000 Census, the Census releases the distributio
race/ethnicity for that last name. For example, the last n
“Myers” was observed to correspond to Caucasians 86%
the time, African-Americans 9.7%, Asians 0.4%, and
panics 1.4%.
Race/ethnicity distribution of Twitter users
We first determined the number of U.S.-based users
whom we could infer the race/ethnicity by comparing
last word of their self-reported name to the U.S. Ce
last name list. We observed that we found a match
71.8% of the users. We the determined the distributio
race/ethnicity in each county by taking the race/ethn
distribution in the Census list, weighted by the freque
of each name occurring in Twitter users in that coun
Due to the large amount of ambiguity in the last name
race/ethnicity list (in particular, the last name list is m
than 95% predictive for only 18.5% of the users), we are
able to directly compare the Twitter race/ethnicity distr
1
This is effectively the census.model approach discuss
prior work (Chang et al. 2010).
(a) Normal representation
Figure 2: Per-county over- and underrepresentation of U.S. po
tation rate of 0.324%, presented in both (a) a normal layout an
Blue colors indicate underrepresentation, while red colors repre
to the log of the over- or underrepresentation rate. Clear trend
overrepresentation of populous counties.
less than 95% predictive (e.g., the name Avery was observed
to correspond to male babies only 56.8% of the time; it was
Undersampling
Oversampling
(a) Caucasian (non-hispanic) (b) African-American (c) Asian or Pacific Islander (d) Hispanic
Figure 4: Per-county area cartograms of Twitter over- and undersampling rates of Caucasian, African-American, Asian, and
Hispanic users, relative to the 2000 U.S. Census. Only counties with more than 500 Twitter users with inferred race/ethnicity
are shown. Blue regions correspond to undersampling; red regions to oversampling.
29. www.bgoncalves.com@bgoncalves
wo possible cases in networks with
: ͑a͒ positively correlated nets and ͑b͒
width of the line of the links represents
CAL REVIEW E 76, 066106 ͑2007͒
A
C
B
kin = 1
kout = 2
sin = 1
sout = 2
kin = 2
kout = 1
sin = 3
sout = 1 kin = 1
kout = 1
sin = 1
sout = 2
Figure 2: Example of a meme diffusion network involving
three users mentioning and retweeting each other. The val-
ues of various node statistics are shown next to each node.
The strength s refers to weighted degree, k stands for degree.
Observing a retweet at node B provides implicit confirma-
tion that information from A appeared in B’s Twitter feed,
while a mention of B originating at node A explicitly con-
firms that A’s message appeared in B’s Twitter feed. This
may or may not be noticed by B, therefore mention edges
are less reliable indicators of information flow compared to
retweet edges.
Retweet and reply/mention information parsed from the
text can be ambiguous, as in the case when a tweet is marked
as being a ‘retweet’ of multiple people. Rather, we rely
on Twitter metadata, which designates users replied to or
retweeted by each message. Thus, while the text of a tweet
may contain several mentions, we only draw an edge to the
user explicitly designated as the mentioned user by the meta-
data. In so doing, we may miss retweets that do not use the
explicit retweet feature and thus are not captured in the meta-
data. Note that this is separate from our use of mentions as
memes (§ 3.1), which we parse from the text of the tweet.
4 System Architecture
Figure 3
website,
memes.
detailed
per day
lion twe
process
network
to produ
acteristic
analyses
sification
4.2 M
A secon
The Strength of Ties
30. www.bgoncalves.com@bgoncalves
Weak
wo possible cases in networks with
: ͑a͒ positively correlated nets and ͑b͒
width of the line of the links represents
CAL REVIEW E 76, 066106 ͑2007͒
A
C
B
kin = 1
kout = 2
sin = 1
sout = 2
kin = 2
kout = 1
sin = 3
sout = 1 kin = 1
kout = 1
sin = 1
sout = 2
Figure 2: Example of a meme diffusion network involving
three users mentioning and retweeting each other. The val-
ues of various node statistics are shown next to each node.
The strength s refers to weighted degree, k stands for degree.
Observing a retweet at node B provides implicit confirma-
tion that information from A appeared in B’s Twitter feed,
while a mention of B originating at node A explicitly con-
firms that A’s message appeared in B’s Twitter feed. This
may or may not be noticed by B, therefore mention edges
are less reliable indicators of information flow compared to
retweet edges.
Retweet and reply/mention information parsed from the
text can be ambiguous, as in the case when a tweet is marked
as being a ‘retweet’ of multiple people. Rather, we rely
on Twitter metadata, which designates users replied to or
retweeted by each message. Thus, while the text of a tweet
may contain several mentions, we only draw an edge to the
user explicitly designated as the mentioned user by the meta-
data. In so doing, we may miss retweets that do not use the
explicit retweet feature and thus are not captured in the meta-
data. Note that this is separate from our use of mentions as
memes (§ 3.1), which we parse from the text of the tweet.
4 System Architecture
Figure 3
website,
memes.
detailed
per day
lion twe
process
network
to produ
acteristic
analyses
sification
4.2 M
A secon
The Strength of Ties
31. www.bgoncalves.com@bgoncalves
Weak
• Interviews to find out how individuals found out about job opportunities.
• Mostly from acquaintances or friends of friends
• “It is argued that the degree of overlap of two individuals social networks varies directly
with the strength of their tie to one another”
wo possible cases in networks with
: ͑a͒ positively correlated nets and ͑b͒
width of the line of the links represents
CAL REVIEW E 76, 066106 ͑2007͒
A
C
B
kin = 1
kout = 2
sin = 1
sout = 2
kin = 2
kout = 1
sin = 3
sout = 1 kin = 1
kout = 1
sin = 1
sout = 2
Figure 2: Example of a meme diffusion network involving
three users mentioning and retweeting each other. The val-
ues of various node statistics are shown next to each node.
The strength s refers to weighted degree, k stands for degree.
Observing a retweet at node B provides implicit confirma-
tion that information from A appeared in B’s Twitter feed,
while a mention of B originating at node A explicitly con-
firms that A’s message appeared in B’s Twitter feed. This
may or may not be noticed by B, therefore mention edges
are less reliable indicators of information flow compared to
retweet edges.
Retweet and reply/mention information parsed from the
text can be ambiguous, as in the case when a tweet is marked
as being a ‘retweet’ of multiple people. Rather, we rely
on Twitter metadata, which designates users replied to or
retweeted by each message. Thus, while the text of a tweet
may contain several mentions, we only draw an edge to the
user explicitly designated as the mentioned user by the meta-
data. In so doing, we may miss retweets that do not use the
explicit retweet feature and thus are not captured in the meta-
data. Note that this is separate from our use of mentions as
memes (§ 3.1), which we parse from the text of the tweet.
4 System Architecture
Figure 3
website,
memes.
detailed
per day
lion twe
process
network
to produ
acteristic
analyses
sification
4.2 M
A secon
The Strength of Ties (1973)
32. www.bgoncalves.com@bgoncalves
Weak
• Interviews to find out how individuals found out about job opportunities.
• Mostly from acquaintances or friends of friends
• “It is argued that the degree of overlap of two individuals social networks varies directly
with the strength of their tie to one another”
wo possible cases in networks with
: ͑a͒ positively correlated nets and ͑b͒
width of the line of the links represents
CAL REVIEW E 76, 066106 ͑2007͒
A
C
B
kin = 1
kout = 2
sin = 1
sout = 2
kin = 2
kout = 1
sin = 3
sout = 1 kin = 1
kout = 1
sin = 1
sout = 2
Figure 2: Example of a meme diffusion network involving
three users mentioning and retweeting each other. The val-
ues of various node statistics are shown next to each node.
The strength s refers to weighted degree, k stands for degree.
Observing a retweet at node B provides implicit confirma-
tion that information from A appeared in B’s Twitter feed,
while a mention of B originating at node A explicitly con-
firms that A’s message appeared in B’s Twitter feed. This
may or may not be noticed by B, therefore mention edges
are less reliable indicators of information flow compared to
retweet edges.
Retweet and reply/mention information parsed from the
text can be ambiguous, as in the case when a tweet is marked
as being a ‘retweet’ of multiple people. Rather, we rely
on Twitter metadata, which designates users replied to or
retweeted by each message. Thus, while the text of a tweet
may contain several mentions, we only draw an edge to the
user explicitly designated as the mentioned user by the meta-
data. In so doing, we may miss retweets that do not use the
explicit retweet feature and thus are not captured in the meta-
data. Note that this is separate from our use of mentions as
memes (§ 3.1), which we parse from the text of the tweet.
4 System Architecture
Figure 3
website,
memes.
detailed
per day
lion twe
process
network
to produ
acteristic
analyses
sification
4.2 M
A secon
The Strength of Ties (1973)
for a time sufficient to its
ale communication network
nd the calls among them links.
indicates a particular egocentric network evolution. In order to
quantify it, we measure the probability, p(n), that the next
communication event of an agent having n social ties will occur via
the establishment of a new (n 1 1)th
link. We calculate these
probabilities in the MPC dataset averaging them for users with the
same degree k at the end of the observation time. We therefore
. Panels (a), and (b) show calls within 3 hours between people in the same town in two different time windows.
al network structure, which was recorded by aggregating interactions during 6 months. Node size and colors
idth and color represent weight.
33. www.bgoncalves.com@bgoncalves
Neighborhood Overlap PNAS 104, 7333 (2007)
10
0
10
1
10
2
10
10
6
10
4
100 102 104 106 108
10
10 12
10 10
10 8
10 6
vi vj
Oij=0 Oij=1/3
Oij=1Oij=2/3
<O>
w
,<O>
b
0 0.2 0.4 0.6 0.8 1
0
0.05
0.1
0.15
0.2
P
cum
(w), P
cum
(b)
C D
Degree k Link weight w (s)
P(k)
P(w)
Fig. 1. Characterizing the large-scale structure and the tie strengths of the
mobile call graph. (A and B) Vertex degree (A) and tie strength distribution (B).
Each distribution was fitted with P(x) ϭ a(x ϩ x0)Ϫx exp(Ϫx/xc), shown as a blue
curve, where x corresponds to either k or w. The parameter values for the fits
weight
betweeness
reshuffled
34. www.bgoncalves.com@bgoncalves
Strong Ties have higher overlaps PNAS 104, 7333 (2007)
A
B
1
100
10
10
2 100 102 104 106 108
10
10 12
10 10
Oij=1/3
Oij=1
<O>
w
,<O>
b
0 0.2 0.4 0.6 0.8 1
0
0.05
0.1
0.15
0.2
P
cum
(w), P
cum
(b)
C D
e k Link weight w (s)
g the large-scale structure and the tie strengths of the
d B) Vertex degree (A) and tie strength distribution (B).
tted with P(x) ϭ a(x ϩ x0)Ϫx exp(Ϫx/xc), shown as a blue
onds to either k or w. The parameter values for the fits
kc ϭ ϱ (A, degree), and w0 ϭ 280, ␥w ϭ 1.9, wc ϭ 3.45 ϫ
tration of the overlap between two nodes, vi and vj, its
r four local network configurations. (D) In the real
O͘w (blue circles) increases as a function of cumulative
representing the fraction of links with tie strength
adic hypothesis is tested by randomly permuting the
s the coupling between ͗O͘w and w (red squares). The
as a function of cumulative link betweenness centrality
B
C
1
100
10
.
one communication.
102 104 106 108
0.4 0.6 0.8 1
P
cum
(w), P
cum
(b)
D
Link weight w (s)
the tie strengths of the
strength distribution (B).
p(Ϫx/xc), shown as a blue
meter values for the fits
80, ␥w ϭ 1.9, wc ϭ 3.45 ϫ
n two nodes, vi and vj, its
rations. (D) In the real
a function of cumulative
links with tie strength
andomly permuting the
and w (red squares). The
k betweenness centrality
B
C
1
100
10
.
APPLIEDPHYSICAL
SCIENCES
Real Randomized
Betweeness
35. www.bgoncalves.com@bgoncalves
Network Structure PLoS One 7, e29358 (2012)
The Strength of Intermediary Ties in Social Media
“People whose networks bridge the structural holes
between groups have an advantage in detecting and
developing rewarding opportunities. Information
arbitrage is their advantage. They are able to see
early, see more broadly, and translate information
across groups.”
AJS Volume 110 Number 2 (September 2004): 349–99
᭧ 2004 by The University of Chicago. All rights reserved.
0002-9602/2004/11002-0004$10.00
Structural Holes and Good Ideas1
Ronald S. Burt
University of Chicago
This article outlines the mechanism by which brokerage prov
social capital. Opinion and behavior are more homogeneous w
than between groups, so people connected across groups are m
familiar with alternative ways of thinking and behaving. Broke
across the structural holes between groups provides a vision o
tions otherwise unseen, which is the mechanism by which broke
becomes social capital. I review evidence consistent with the
pothesis, then look at the networks around managers in a
American electronics company. The organization is rife with s
tural holes, and brokerage has its expected correlates. Compensa
positive performance evaluations, promotions, and good idea
disproportionately in the hands of people whose networks
structural holes. The between-group brokers are more likely t
press ideas, less likely to have ideas dismissed, and more like
have ideas evaluated as valuable. I close with implications for
ativity and structural change.
The hypothesis in this article is that people who stand near the hol
a social structure are at higher risk of having good ideas. The argum
is that opinion and behavior are more homogeneous within than betw
groups, so people connected across groups are more familiar with a
1
Portions of this material were presented as the 2003 Coleman Lecture at the Univ
of Chicago, at the Harvard-MIT workshop on economic sociology, in worksho
the University of California at Berkeley, the University of Chicago, the Univers
Kentucky, the Russell Sage Foundation, the Stanford Graduate School of Bus
the University of Texas at Dallas, Universiteit Utrecht, and the “Social Aspe
Rationality” conference at the 2003 meetings of the American Sociological Associ
I am grateful to Christina Hardy for her assistance on the manuscript and to se
colleagues for comments affecting the final text: William Barnett, James Baron
athan Bendor, Jack Birner, Matthew Bothner, Frank Dobbin, Chip Heath, R
Kranton, Rakesh Khurana, Jeffrey Pfeffer, Joel Podolny, Holly Raider, James R
Don Ronchi, Ezra Zuckerman, and two AJS reviewers. I am especially grate
Peter Marsden for his comments as discussant at the Coleman Lecture. Direc
respondence to Ron Burt, Graduate School of Business, University of Chicago
cago, Illinois 60637. E-mail: ron.burt@gsb.uchicago.edu
36. www.bgoncalves.com@bgoncalves
Network Structure PLoS One 7, e29358 (2012)
ation that the stronger the tie is the higher
acts of both parties it has and the higher the
belong to the same group.
groups
to consider is the characteristics of links
ese links occur mainly between groups
200 users (Figure 4A–C). However, their
he quality of the links (if they bear mentions
ks with mentions are less abundant than the
retweets are slightly more abundant.
ngth of weak ties theory [12,14–16], weak
between which they take place should be small according to the
Granovetter’s theory. The results show that the most likely to
attract retweets are the links connecting groups that are neither too
close nor too far. This can be explained with Aral’s theory about
the trade-off between diversity and bandwidth: if the two groups
are too close there is no enough diversity in the information, while
if the groups are too far the communication is poor. These trends
are not dependant on the size of the considered groups (see Figs.
S6, S7, S8, S9, S10, S11, S12, S13, S14 and Table S1 in the
Supplementary Information).
ink statistics. (A) Size distribution of the group. (B) Distribution of the number of groups to which each user is assigned.
f different types, e.g. follower links (black bars), links with mentions (red bars) or retweets (green bars), staying in particular
in respect to detected groups.
.0029358.g002
The Strength of Intermediary Ties in Social Media
to Granovetter expectation that the stronger the
number of mutual contacts of both parties it has a
Figure 2. Group and link statistics. (A) Size distri
(C) Percentage of links of different types, e.g. followe
topological localizations in respect to detected grou
doi:10.1371/journal.pone.0029358.g002
The
37. www.bgoncalves.com@bgoncalves
Groups PLoS One 7, e29358 (2012)
Figure 4. Group-group activity. (A) Distribution of the number of links in the follower network between groups as a function of the
groups. (B) Fractions f of links of the different types (follower, with mentions and with retweets) as a function of the size of the group
The Strength of Intermediary Ties in So
2.4 Links between groups
The next question to consider is the characteristic
between groups. These links occur mainly betwee
containing less than 200 users (Figure 4A–C). Howe
frequency depends on the quality of the links (if they bear
or retweets). While links with mentions are less abundan
baseline, those with retweets are slightly more
According to the strength of weak ties theory [12,14–
links are typically connections between persons no
neighbors, being important to keep the network conn
for information diffusion. We investigate whether
between groups play a similar role in the online n
information transmitters. The actions more related to in
diffusion are retweets [24] that show a slight prefe
occurring on between-group links (Figures 4B and
preference is enhanced when the similarity between
groups is taken into account. We define the similarity be
groups, A and B, in terms of the Jaccard index
connections:
similarity(A,B)~
jlinks of A and Bj
j|links of A and Bj
:
The similarity is the overlap between the groups’ connec
it estimates network proximity of the groups. The gener
is that links with mentions more likely occur between clo
and retweets occur between groups with medium
(Figure 4D). Mentions as personal messages are
exchanged between users with similar environments
predicted by the strength of weak ties theory. Links with
are related to information transfer and the similarity of t
PLoS ONE | www.plosone.org
40. www.bgoncalves.com@bgoncalves
Twitter Follower Distance Social Networks 34, 73 (2012)
Y. Takhteyev et al. / Social Networks 34 (2012) 73–81
f physical distances between egos and alters. The graph shows the number of ties by distance, in 200 km bins (for example, New
ed towards the 5400 km bin). The total number of ties in each of the two simulations is the same as in the observed data. Based on
41. www.bgoncalves.com@bgoncalves
Locality Social Networks 34, 73 (2012)
Y. Takhteyev et al. / Social Networks 34 (2012) 73–81 79
Table 5
Top countries.
Share of
egos (%)a
Share of egos
(%) for egos in
dyadsb
Share of
alters (%)c
Percentage of
domestic tiesd
Percentage of
domestic ties among
non-local tiesd
Following foreign
alters/being followed
from abroad
Country named
explicitly (% of
egos)
USA 48.5 45.7 54.5 91.6 89.3 0.3 8.1
Brazil 10.6 12.1 10.5 83.5 72.5 4.9 55.4
UK 7.6 8.3 7.6 50.6 33.3 1.2 45.3
Japan 5.5 6.5 6.3 92.1 86.0 1.4 25.0
Canada 3.7 3.8 2.9 33.3 23.1 1.6 58.5
Australia 2.7 2.7 1.9 50.0 32.0 2.2 69.7
Indonesia 2.6 1.8 1.2 60.0 25.0 7.0 83.3
Germany 2.1 1.8 1.3 62.9 58.8 3.2 58.6
Netherlands 1.4 1.4 1.2 66.7 22.2 1.5 54.3
Mexico 1.2 1.3 0.7 44.0 8.3 7.0 56.7
a
Out of the 2852 egos located at the level of country or better.
b
Out of the egos included in 1953 dyads with both parties located at the level of country or better.
c
Out of the 1953 alters located at the level of country or better.
d
The number of ties with the ego and the alter in the given country as a share of all ties for egos in that country.
between those two interpretations. We also note that top Twitter
clusters intersect only to an extent with Alderson and Beckfield’s
(2004) ranking of world cities based on multinational corporations’
branch headquarters. (Of Alderson and Beckfield’s top 25 cities by
in-degree or “prestige,” 13 appear in the top 25 Twitter clusters
ranked by in-degree centrality, with another 6 appearing in top
100.)
5.3. National borders
Of the ties that were matched to countries, 75 percent con-
nect users in the same country. This prevalence of domestic ties is
Table 6
The most common languages. Based on 2852 egos.
Language % of egos
English 72.5
Portuguese 10.1
Japanese 5.4
Spanish 3.1
Indonesian 1.8
German 1.7
Dutch 1.0
Chinese 0.9
Korean 0.4
Swedish 0.4
Y. Takhteyev et al. / Social Networks 34 (2012) 73–81 77
accounts, by randomly drawing an account from among those “fol-
lowed” by each of those egos. We then coded the locations of the
alters using the same procedure as we did for the egos, removing
those pairs where the alter could not be assigned to a country. In
the end, we obtained a sample of 1953 ego-alter pairs with both
the ego and the alter assigned to a country, including 1259 pairs
with “specific” locations for both parties (Table 1).
4.4. Aggregating nearby locations
Since specific locations vary substantially in precision and since
users can often choose between a range of specific names for the
same place (e.g., “Palo Alto” vs. “Silicon Valley” vs. “SF Bay”), we
aggregated nearby locations within each country, by assigning a
set of coordinates (obtained from Google Maps) to each location
smaller than 25,000 km2 and then merging nearby locations within
each country by replacing their coordinates with a weighted aver-
age of the coordinates of the merged locations. This reduced our
location descriptions to a set of 386 regional clusters, which are
comparable in size to metropolitan areas. We labeled each clus-
ter with the most common name associated with it in our sample.
For example, the cluster centered on Manhattan is referred to as
“New York.”
5. Analysis
In this section we analyze the factors affecting the formation of
Twitter ties. We first look at the effect of each variable identified
earlier based on theoretical considerations: the actual physical dis-
tance, the frequency of air travel, national boundaries, and language
differences. In addition to presenting the descriptive statistics
demonstrating the effects of each variable and investigating the
nature of such effects, we correlated the effects using the Quadratic
Assignment Procedure (QAP, Krackhardt, 1987; Butts, 2007). In the
last subsection we also examined the relationship between the
variables using QAP regression (Double Dekker Semi-partialling
MRQAP). All statistical calculations were done using UCINet 6.277
(Borgatti et al., 2002).
For correlation and regression analysis we used networks with
nodes representing the 25 largest regional clusters of users (see
Table 3
Top clusters.
Rank Clustera
Share of
egos (%)b
Share of egos
(%) for egos in
dyadsc
Share of
alters (%)d
Localitye
1 “New York” 8.5 8.3 10.2 54.3
2 “Los Angeles, CA” 5.1 5.6 10.4 53.3
3 “ ” (Tokyo) 4.1 4.8 5.0 62.9
4 “London” 3.6 3.3 4.9 48.8
5 “São Paulo” 3.5 3.0 3.6 78.4
6 “San Francisco” 2.8 2.7 4.1 41.2
7 “New Jersey”f
2.5 2.8 2.1 20.0
8 “Chicago” 2.2 2.0 1.7 32.0
9 “Washington, DC” 2.1 2.8 2.6 34.3
10 “Manchester, UK” 1.9 2.0 1.1 30.8
11 “Atlanta” 1.7 2.1 2.1 46.2
12 “San Diego” 1.5 1.5 1.1 26.3
13 “Toronto, Canada” 1.3 1.1 1.5 42.9
14 “Seattle” 1.3 1.4 1.2 58.8
15 “Houston” 1.2 1.2 1.0 40.0
16 “Dallas, Texas” 1.2 1.0 1.4 61.5
17 “Rio de Janeiro” 1.2 1.0 1.1 30.8
18 “Boston, MA” 1.2 1.2 1.1 20.0
19 “Amsterdam” 1.1 1.1 0.9 50.0
20 “Jakarta, Indonesia” 1.1 0.6 0.3 42.9
21 “Austin, TX” 1.0 1.0 1.3 50.0
22 “Sydney” 0.9 1.0 0.8 38.5
23 “Orlando, Forida” 0.9 1.0 0.6 16.7
24 “Phoenix, AZ” 0.8 0.7 0.6 11.1
25 “ ” (Hy¯ogo)g
0.8 1.0 1.0 25.0
a
Each cluster is labeled with the name most frequently used for locations assigned
to the cluster.
b
Out of the 2167 egos located with precision of <25,000 km2
.
c
Out of the 1259 egos included in dyads with both parties located with precision
of <25,000 km2
.
d
Out of the 1259 alters included in dyads with both parties located with precision
of <25,000 km2
.
e
Defined as the share of local of ties among all ties for egos in a cluster.
f
Centered between Philadelphia and Trenton, NJ and includes all locations iden-
tified as just “New Jersey”.
g
Centered near the boundary between Hy¯ogo and Osaka prefectures, in the Kansai
region of Japan.
over half of the egos are in other countries, as are 4 of the 10
largest clusters: Tokyo, São Paulo, and two clusters in the United
43. www.bgoncalves.com@bgoncalves
Population Heterogeneity Social Networks 34, 82 (2012)
• Bernoulli process to generate adjacency matrix given a distance matrix between nodes
• Above some density threshold, networks is naturally connected.et al. / Social Networks 34 (2012) 82–100 85
Fig. 3. Emergence of local connectivity on an uneven population density surface.
Where the threshold population density for an approximately uniform region of
P (A = a|D) =
Y
{i,j}
B (Aij = aij|F (Dij, ✓))
! ( )
44. www.bgoncalves.com@bgoncalves
Vertex Placement Social Networks 34, 82 (2012)
88 C.T. Butts et al. / Social Networks 34 (2012) 82–100
Fig. 5. Comparison of uniform and quasi-random vertex placement, Quay County, NM MSA. Lines indicate census block boundaries, with artificial elevation shown via vertex
color. Insets provide detail of 2 km × 2 km portion of Tucumcari, NM. (For interpretation of the references to color in this figure legend, the reader is referred to the web
version of the article.)
45. www.bgoncalves.com@bgoncalves
Friendship probability Social Networks 34, 82 (2012)
• Probability that two people are friends as a function of distance:
• with (0.533, 0.032, 2.788) for “social friendships” and (0.859, 0.035, 6.437) for “face-
to-face interactions”.
F (d) =
✓1
(1 + ✓2d)
✓3
46. www.bgoncalves.com@bgoncalves
Social Network Properties Social Networks 34, 82 (2012)C.T. Butts et al. / Social Networks 34 (2012) 82–100 97
Fig. 12. Marginal degree distributions by location, SIF, and placement model. Friendship model distributions are shown in blue, interaction model distributions in black;
solid lines indicate uniform placement, with quasi-random placement in dotted lines. (For interpretation of the references to color in this figure legend, the reader is referred
to the web version of the article.)
47. www.bgoncalves.com@bgoncalves
Co-occurences and Social Ties PNAS 107, 22436 (2010)
• Geotagged Flickr Photos
• Divide the world into a grid
Count number of cells on which two individuals were within a given interval
randomly selected Flickr users have a 0.0134% chance of having
a social tie, but when two users have multiple spatio-temporal co-
A Model of Spatio-Temp
small number of co-occu
greater probabilities of a
investigation of the und
basic effect is a robust on
models of social netwo
probabilistic model for h
We begin with a simpl
matches the observed d
To formulate the sim
divided into N geograp
There are M people, eac
network consists of M∕
friends chooses to visit
dependently with proba
location(s) is made un
the probability that two
they visit exactly the sam
A Jan 3
+
A Jan 1
+
A Jan 6
+
A Jan 5
+
B Jan 2
+
B Jan 1
+
B Jan 7
+
B Jan 8
+
A Jan 1
+
B Jan 1
+
s
s
A Jan 8
+
B Jan 1
+
Fig. 1. Illustration of how spatio-temporal co-occurrences are counted, for
some sample time-stamped observations of individuals A and B. The world is
divided into discrete cells of size s × s, and we count the number of cells k in
which the two individuals have been observed within a time threshold of t
days—in this case, k ¼ 3 when t is 2.
48. www.bgoncalves.com@bgoncalves
Co-occurences and Social Ties PNAS 107, 22436 (2010)
0 5 10 15 20
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
# of contemporaneous events
Probabilityoffriendship
1 day
7 days
14 days
28 days
1 year
0 5 10 15 20
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
# of contemporaneous events
Probabilityoffriendship
1 day
7 days
14 days
28 days
1 year
0 5 10 15 20
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
# of contemporaneous events
Probabilityoffriendship
1 day
7 days
14 days
28 days
1 year
0 5 10 15 20
10
3
10
2
10
1
10
0
# of contemporaneous events
Probabilityoffriendship
1 day
7 days
14 days
28 days
1 year
0 5 10 15 20
10
3
10
2
10
1
10
0
# of contemporaneous events
Probabilityoffriendship
1 day
7 days
14 days
28 days
1 year
0 5 10 15 20
10
3
10
2
10
1
10
0
# of contemporaneous events
Probabilityoffriendship
1 day
7 days
14 days
28 days
1 year
A B C
0.6
0.7
0.8
0.9
1
friendship
1 day
7 days
14 days
28 days
1 year
0.6
0.7
0.8
0.9
1
friendship
1 day
7 days
14 days
28 days
1 year
s = 0 .001•
s = 0 .01•
s = 0 .1•
49. www.bgoncalves.com@bgoncalves
Human Mobility Nature 453, 779 (2008)
on should increase with time as rg(t) , t
for an RW, rg(t) , t1/2
; that is, the longer we
higher the chance that she/he will travel to areas
To check the validity of these predictions, we
dependence of the radius of gyration for users
ius would be considered small (rg(T) # 3 km),
) # 30 km) or large (rg(T) . 100 km) at the end
period (T 5 6 months). The results indicate that
is, F(x) , x for x , 1 and F(x) rapidly decreases for x ? 1.
Therefore, the travel patterns of individual users may be approxi-
mated by a Le´vy flight up to a distance characterized by rg. Most
important, however, is the fact that the individual trajectories are
bounded beyond rg; thus, large displacements, which are the source
of the distinct and anomalous nature of Le´vy flights, are statistically
absent. To understand the relationship between the different expo-
nents, we note that the measured probability distributions are related
n mobility patterns. a, Week-long trajectory of 40
ndicates that most individuals travel only over short
egularly move over hundreds of kilometres. b, The
a single user. The different phone towers are shown as
for each location is shown as a vertical bar. The circle represents the radius of
gyration centred in the trajectory’s centre of mass. c, Probability density
function P(Dr) of travel distances obtained for the two studied data sets D1
and D2. The solid line indicates a truncated power law for which the
P ( r) = ( r + r0) exp ( rg/)
Cell Phones
50. www.bgoncalves.com@bgoncalves
Human Mobility Nature 453, 779 (2008)
Received 19 December 2007; accepted 27 March 2008. 20. Baraba´si, A.-L. The origin of bursts and heavy tails in human dynamics. Nat
–15 0 15
–15
0
15
–150 0 150
–150
0
150
–1,200 0 1,200
–1,200
0
1,200
Fmin
F max
x (km) x (km) x (km)
y(km)
y(km)
y(km)y/sy
y/sy
x/sx
0.1
0.2
0.3
0.4
sy/sx
10 100 1,0001
rg (km)
–15 0 15
–15
0
15
–15
0
15
–15
0
15
–15 0 15 –15 0 15
–10 –5 0 5 10
10–6
10–4
10–2
100
(x/sx,0)
rg ≤ 3 km
20 km < rg < 30 km
rg > 100 km
~
10–2
10–3
10–4
10–5
10–6
a
b
dc y/sy
x/sx x/sx
x/sx
F
Figure 3 | The shape of human trajectories.
a, The probability density function W(x, y) o
finding a mobile phone user in a location (x, y
the user’s intrinsic reference frame (see
Supplementary Information for details). The
three plots, from left to right, were generated
10,000 users with: rg # 3, 20 , rg # 30 and
rg . 100 km. The trajectories become more
anisotropic as rg increases. b, After scaling ea
position with sx and sy, the resulting
~WW x=sx,y
sy
À Á
has approximately the same sh
for each group. c, The change in the shape of
W(x, y) can be quantified calculating the isotr
ratio S ; sy/sx as a function of rg, which decre
as S*r{0:12
g (solid line). Error bars represent
standard error. d, ~WW x=sx,0ð Þ representing th
x-axis cross-section of the rescaled distributi
~WW x=sx,y
sy
À Á
shown in b.
LETTERS NATURE|Vol 453|5 June 20
Cell Phones
51. www.bgoncalves.com@bgoncalves
Privacy Sci Rep 3, 1376 (2013)
function fits the data better than other two-parameters functions
such as a 2 exp (lx), a stretched exponential a 2 exp xb
, or a
standard linear function a 2 bx (see Table S1). Both estimators for
a and b are highly significant (p , 0.001)32
, and the mean pseudo-R2
is 0.98 for the Ip54 case and the Ip510 case. The fit is good at all levels
of spatial and temporal aggregation [Fig. S3A–B].
The power-law dependency of e means that, on average, each time
the spatial or temporal resolution of the traces is divided by two, their
uniqueness decreases by a constant factor , (2)2b
. This implies that
privacy is increasingly hard to gain by lowering the resolution of a
dataset.
to larger populations, or geographies. An increase in population
density will tend to decrease e. Yet, it will also be accompanied by
an increase in the number of antennas, businesses or WiFi hotspots
used for localizations. These effects run opposite to each other, and
therefore, suggest that our results should generalize to higher popu-
lation densities.
Extensions of the geographical range of observation are also
unlikely to affect the results as human mobility is known to be highly
circumscribed. In fact, 94% of the individuals move within an average
radius of less than 100 km17
. This implies that geographical exten-
sions of the dataset will stay locally equivalent to our observations,
Figure 2 | (A) Ip52 means that the information available to the attacker consist of two 7am-8am spatio-temporal points (I and II). In this case, the target
was in zone I between 9am to 10am and in zone II between 12pm to 1pm. In this example, the traces of two anonymized users (red and green) are
compatible with the constraints defined by Ip52. The subset S(Ip52) contains more than one trace and is therefore not unique. However, the green trace
would be uniquely characterized if a third point, zone III between 3pm and 4pm, is added (Ip53). (B) The uniqueness of traces with respect to the number
p of given spatio-temporal points (Ip). The green bars represent the fraction of unique traces, i.e. |S(Ip)| 5 1. The blue bars represent the fraction of |S(Ip)|
# 2. Therefore knowing as few as four spatio-temporal points taken at random (Ip54) is enough to uniquely characterize 95% of the traces amongst 1.5 M
users. (C) Box-plot of the minimum number of spatio-temporal points needed to uniquely characterize every trace on the non-aggregated database. At
most eleven points are enough to uniquely characterize all considered traces.
www.nature.com/scientificreports
of spatial and temporal aggregation [Fig. S3A–B].
The power-law dependency of e means that, on average, each time
the spatial or temporal resolution of the traces is divided by two, their
uniqueness decreases by a constant factor , (2)2b
. This implies that
privacy is increasingly hard to gain by lowering the resolution of a
dataset.
Fig. 2B shows that, as expected, e increases with p. The mitigating
effect of p on e is mediated by the exponent b which decays linearly
with p: b 5 0.157 2 0.007p [Fig. 4E]. The dependence of b on p
implies that a few additional points might be all that is needed to
identify an individual in a dataset with a lower resolution. In fact,
given four points, a two-fold decrease in spatial or temporal resolu-
tion makes it 9.3% less likely to identify an individual, while given ten
points, the same two-fold decrease results in a reduction of only 6.2%
(see Table S1).
Because of the functional dependency of e on p through the expo-
nent b, mobility datasets are likely to be re-identifiable using
information on only a few outside locations.
Discussion
Our ability to generalize these results to other mobility datasets
depends on the sensitivity of our analysis to extensions of the data
lation densities.
Extensions of the geographical range of observation are also
unlikely to affect the results as human mobility is known to be highly
circumscribed. In fact, 94% of the individuals move within an average
radius of less than 100 km17
. This implies that geographical exten-
sions of the dataset will stay locally equivalent to our observations,
making the results robust to changes in geographical range.
From an inference perspective, it is worth noticing that the spatio-
temporal points do not equally increase the likelihood of uniquely
identifying a trace. Furthermore, the information added by a point is
highly dependent from the points already known. The amount of
information gained by knowing one more point can be defined as the
reduction of the cardinality of S(Ip) associated with this extra point.
The larger the decrease, the more useful the piece of information is.
Intuitively, a point on the MIT campus at 3AM is more likely to
make a trace unique than a point in downtown Boston on a Friday
evening.
This study is likely to underestimate e, and therefore the ease of re-
identification, as the spatio-temporal points are drawn at random
from users’ mobility traces. Our Ip are thus subject to the user’s
spatial and temporal distributions. Spatially, it has been shown that
the uncertainty of a typical user’s whereabouts measured by its
10 6
10 5
10 4
10 3
10 0
10 1
10 2
10 3
Number of antennas
Inhabitants
Probabilitydensityfunction
Median inter-interactions time per user [h]
0 12 24 36 48 60 72 84 96
10 0
10 -1
10 -2
10 -3
10 -4
10 0
10 -1
10 -2
10 -3
10 -4
10 -5
0 500 1000 1500 2000 2500
Number of interactions
Probabilitydensityfunction
A B C
Figure 3 | (A) Probability density function of the amount of recorded spatio-temporal points per user during a month. (B) Probability density function
of the median inter-interaction time with the service. (C) The number of antennas per region is correlated with its population (R2
5 .6426). These plots
strongly emphasize the discrete character of our dataset and its similarities with datasets such as the one collected by smartphone apps.Cell Phones
53. www.bgoncalves.com@bgoncalves
Gravity Law of Commuting PNAS 106, 21484 (2009)
wij
i
j
US county commuting network
each node i : subpopulation (census area)
each link (ij) : interaction between subpopulations i and j
weight wij : number of people commuting from i to j per unit time
54. www.bgoncalves.com@bgoncalves
Gravity Law of Commuting PNAS 106, 21484 (2009)
w
(D)
/w
(M)
C)
E) F)
Distance (km)
w
(D)
/(NN)
10
-5
10
-4
10
-3
10
-2
10
-
10
10
10
-2
10
0
10
2
10
-
10
10
Population of origin
w
(D)
/w
(M)
w
(D)
/w
(M)
0 100 200 300
10
2
10 10
6
10
8
D)
ij
!
A)
B)10
1
10
3
10
5
10
1
10
3
10
5
C)
Distance (km)
w
(D)
/(NN)
Distance (km)
10
-5
10
-4
10
-3
10
-2
10
-2
10
0
10
2
w
(D)
/w
(M)
0 100 200 300 0 100 200 300
D)
ij
!
A)10
1
10
3
10
5
w
(D)
/w
(M)
C)
E) F)
Distance (km)
w
(D)
/(NN)
Distance (km)
10
-5
10
-4
10
-3
10
-2
10
-2
10
0
10
2
10
-2
10
0
10
2
10
2
10
4
10
6
10
8
10
-2
10
0
10
2
Population of destinationPopulation of origin
w
(D)
/w
(M)
w
(D)
/w
(M)
0 100 200 300 0 100 200 300
10
2
10 10
6
10
8
D)
ij
!
A)
B)
w
(D)
/w
(M)
C)
E) F)
Distance (km)
w
(D)
/(NN)
Distance (km)
10
-5
10
-4
10
-3
10
-2
10
-2
10
0
10
2
10
-2
10
0
10
2
10
2
10
4
10
6
10
8
10
-2
10
0
10
2
Population of destinationPopulation of origin
w
(D)
/w
(M)
w
(D)
/w
(M)
0 100 200 300 0 100 200 300
10
2
10 10
6
10
8
D)
ij
!
A)
B)
136 D. Balcan et al. / Journal of Computational Science 1 (2010) 132–145
Table 1
Commuting networks in each continent. Number of countries (N), number of admin-
istrative units (V) and inter-links between them (E) are summarized.
Continent N V E
Europe 17 65,880 4,490,650
North America 2 6986 182,255
Latin America 5 4301 102,117
Asia 4 4355 380,385
Oceania 2 746 30,679
Total 30 82,268 5,186,186
commuting. This allows to deal with self-similar units across the
world with respect to mobility as emerged from the tessellation and
not country specific administrative boundaries. We have therefore
mapped the different levels of commuting data into the geographi-
cal census areas formed by the Voronoi-like tessellation procedure
described above. The mapped commuting flows can be seen as a
second transport network connecting subpopulations that are geo-
graphically close. This second network can be overlaid to the WAN
in a multi-scale fashion to simulate realistic scenarios for disease
spreading. The network exhibits important variability in the num-
ber of commuters on each connection as well as in the total number
of commuters per geographical census area. Being the census areas
statistically homogeneous we can also extract a general statistical
law that allows for the synthetic generation of commuting net-
works in countries where real data are not available. A full account
of the commuting data obtained across different continents and
their statistical analysis can be found in Ref. [2].
3.3. Disease model
Table 2
Transitions between compartments and their rates.
Transition Type Rate
Sj → Lj Contagion j
Lj → Ia
j
Spontaneous εpa
Lj → It
j
ε(1 − pa)pt
Lj → Int
j
ε(1 − pa)(1 − pt)
Ia
j
→ Rj
It
j
→ Rj
Int
j
→ Rj
general, the force of infection is assumed to follow the mass action
principle for which the infection rate is = ˇI / N where ˇ is the
infection transmission rate and I / N is the density of infected indi-
viduals in the population. In the case of asymptomatic individuals
the force of infection is usually reduced by a factor rˇ. In the case of
multiple interacting subpopulations and different classes of infec-
tives the force of infection will be the sum of different contributions
as reported in Section 4.3.
Given the force of infection j in subpopulation j, each person
in the susceptible compartment (Sj) contracts the infection with
probability j t and enters the latent compartment (Lj), where t
is the time interval considered. Latent individuals exit the compart-
ment with probability ε t, and transit to asymptomatic infectious
compartment (Ia
j
) with probability pa or, with the complemen-
tary probability 1 − pa, become symptomatic infectious. Infectious
persons with symptoms are further divided between those who
can travel (It
j
), probability pt, and those who are travel-restricted
(Int
j
) with probability 1 − pt. All the infectious persons permanently
recover with probability t, entering the recovered compartment
55. www.bgoncalves.com@bgoncalves
Mobility and Social Networks
Coupling Mobility and Interactions in Social Media
and for their dependence on the distance. The error Err of this
null model is between 0:66–0:76 for the three countries, around
twice the error of the TF model (see Figure 6).
The linking model (L model) is a simplified version of the TF
model, without random mobility and the box size d?0. Agents
move to visit their contacts with probability pv, whereas with
probability 1{pv they do not perform any action. In this version
of the model, users can connect only by random connections or
when two of them coincide, visiting a common friend, which leads
to triadic closure. These two processes do not depend on the
distances between the users. A thorough description can be
obtained with a mean-field approach (see the corresponding
section). The results of the L model are shown in Figure 2. Due to
the triangle closing mechanism, this null model creates networks
with a considerable level of clustering. However, it does not
(e.g., for the US the TF model has Err lower by 0:5 and 1:5 than
the TF-normal and the TF-uniform models, respectively, as shown
in Figure 6).
Simplified models that neglect either geography or network
structure perform considerably worse than the TF model in
reproducing the properties of real networks. Likewise, non-realistic
assumptions on human mobility mechanism yield worse results
than the default TF model. To conclude, the coupling of
geography and structure through a realistic mobility mechanism
produces networks with significantly more realistic geographic and
structural properties.
Sensitivity of the TF Model to the Parameters and its
Modifications
The results presented so far have been obtained at the optimal
Figure 4. Simulation results: mobility and social networks. Mobility (upper row) and ego networks (lower row) of 20 random users (different
colors) for the instances of the TF model yielding the lowest error Err (see Figure 3). Mobility network shows mobility patterns of individual users
throughout entire simulation. Ego network shows the social connections at the end of the simulation.
doi:10.1371/journal.pone.0092196.g004
PLoS One 9, E92196 (2014)
56. www.bgoncalves.com@bgoncalves
Geo-Social Properties
Couplin
that has also an edge between i and k, forming a triangle. Note
a triangle consists of 3 triads centered on different nodes.
effect of the distance on the clustering coefficient can
incorporated by measuring the distances from each central n
j to two neighbors i and k forming a triad, d~dijzdjk,
calculating the network clustering restricted to triads with dist
d. This new function C(d) is the probability of closing a tria
given the distance d in a triad
C(d)~
D(d)
L(d)
,
where (d) and (d) are the numbers of triads and closed tr
for the distance d, respectively. The value of the global cluste
coefficient C can be recovered by averaging C(d) over d. In
datasets, we observe a drop in C(d) followed by a plateau, whi
best visible for the US networks (Figure 2E).
Given a triangle, several configurations are possible if the
diversity in the edge lengths. The triangle can be equilateral
the edges have the same length, isosceles if two have the s
length and the other is smaller, etc. We estimate the domi
shapes of the triangles in the network by measuring the dispari
defined as:
D~6
d2
1 zd2
2 zd2
3
(d1zd2zd3)2
{
1
3
,
where d1, d2 and d3 are the geographical distances between
locations of the users forming the triangle. The disparity t
values between 0 and 1 as the shape of the triangle passes f
equilateral to isosceles, where one edge is much smaller than
other two. D shows a distribution with two maxima in the on
social networks (Figure 2F), for low and high values. The two m
C(d).
doi:10.1371/journal.pone.0092196.g002
DL
eo-social properties. Various statistical network properties are plotted for the data obtained from Twitter (red squares),
), Brightkite (green triangles) and the null models (dashed lines), for the US (for the UK and Germany, see Figures S1 and S2).
Coupling Mobility and Interactions in Social Media
Triangle Disparity
eo-social properties. Various statistical network properties are plotted for the data obtained from Twitter (red squares),
), Brightkite (green triangles) and the null models (dashed lines), for the US (for the UK and Germany, see Figures S1 and S2).
enta), based on geography, matches well the data in Pl(d), but yields near-zero values for R(d), Jf (d) and C(d). The linking
triadic closure, produces enough clustering, but it does not reproduce the distance dependencies of Pl(d), R(d), Jf (d) and
e.0092196.g002
Coupling Mobility and Interactions in Social Media
Reciprocity
Figure 2. Network geo-social properties. Various statistical network properties are plotted for the data obtained from Twitter (red squares),
Gowalla (blue diamonds), Brightkite (green triangles) and the null models (dashed lines), for the US (for the UK and Germany, see Figures S1 and S2).
The spatial model (magenta), based on geography, matches well the data in Pl(d), but yields near-zero values for R(d), Jf (d) and C(d). The linking
model (cyan), based on triadic closure, produces enough clustering, but it does not reproduce the distance dependencies of Pl(d), R(d), Jf (d) and
C(d).
doi:10.1371/journal.pone.0092196.g002
Coupling Mobility and Interactions in Social Media
Prob of a Link
ocial properties. Various statistical network properties are plotted for the data obtained from Twitter (red squares),
ightkite (green triangles) and the null models (dashed lines), for the US (for the UK and Germany, see Figures S1 and S2).
Coupling Mobility and Interactions in Social Media
Clustering
PLoS One 9, E92196 (2014)
57. www.bgoncalves.com@bgoncalves
Geo-Social Model
New position of u
{
{
{
Detect all
encounters e
in the box of u
Visit a random
neighbour
Jump to
a new location
Starting position
of user u
Created new
social links
PLoS One 9, E92196 (2014)
58. www.bgoncalves.com@bgoncalves
Model Fitting
0:39 for Germany. For simplicity, we focus on the Twitter
networks only, although similar results are obtained for the other
datasets.
Results
Simulations for the Optimal Parameters
An example with the displacements between the consecutive
locations and the ego networks for a sample of individuals, as
generated by the TF model, are displayed in Figure 4. The
parameters of the model are set to the ones that correspond to the
minimum of the error Err. As shown, the agents tend to stay close
to their original positions. Occasional long jumps occur due to
friend visits that live far apart. In this range of parameters and
simulation times, the main mechanism for generating long distance
second null model, the linking model (L model), in contrast, is
based only on random linking and triadic closure, and it is
equivalent to the TF model without the mobility. We consider the
two uncoupled null models and compare their results with those of
the TF model. In this way, we demonstrate the importance of the
coupling through a realistic mobility mechanism to reproduce the
empirical networks.
The spatial model (S model) consists of randomly connecting
pair of users with a probability that decays as power-law of the
distance between them (suggested in [41]). The exponent of the
power-law is fixed at {0:7 following Figure 2A. The results of
the S model are shown in the panels of Figure 2. While it is set to
match Pl dð Þ, other properties such as P(k), R dð Þ, Jf dð Þ, C dð Þ or
P Dð Þ are not well reproduced. The S model fails to account for the
high level of clustering and reciprocity in the empirical networks
Figure 3. Fitting the TF model. Values of the error Err when pv and pc are changed. The minimum error for each of the plots is marked with a red
rectangle.
doi:10.1371/journal.pone.0092196.g003
PLOS ONE | www.plosone.org 5 March 2014 | Volume 9 | Issue 3 | e92196
Prob. to Make a New Friend
Prob.toVisitanOldFriend
PLoS One 9, E92196 (2014)
59. www.bgoncalves.com@bgoncalves
perties of the model networks. Various statistical properties are plotted for the networks obtained from Twitter data
lation of the TF model (black line) for the US. Corresponding results for the UK and Germany can be found in Figures S3
Coupling Mobility and Interactions in Social Media
al properties of the model networks. Various statistical properties are plotted for the networks obtained from Twitter data
m simulation of the TF model (black line) for the US. Corresponding results for the UK and Germany can be found in Figures S3
Coupling Mobility and Interactions in Social Media
Model Results
Reciprocity
Clustering Triangle Disparity
andom connections, and so the distribution of triangles disparity prevent
Figure 5. Geo-social properties of the model networks. Various statistical pro
red squares) and from simulation of the TF model (black line) for the US. Correspond
nd S4.
doi:10.1371/journal.pone.0092196.g005
that has also an edge between i and k, forming a triangle. Note
a triangle consists of 3 triads centered on different nodes.
effect of the distance on the clustering coefficient can
incorporated by measuring the distances from each central n
j to two neighbors i and k forming a triad, d~dijzdjk,
calculating the network clustering restricted to triads with dist
d. This new function C(d) is the probability of closing a tria
given the distance d in a triad
C(d)~
D(d)
L(d)
,
where (d) and (d) are the numbers of triads and closed tr
for the distance d, respectively. The value of the global cluste
coefficient C can be recovered by averaging C(d) over d. In
datasets, we observe a drop in C(d) followed by a plateau, whi
best visible for the US networks (Figure 2E).
Given a triangle, several configurations are possible if the
diversity in the edge lengths. The triangle can be equilateral
the edges have the same length, isosceles if two have the s
length and the other is smaller, etc. We estimate the domi
shapes of the triangles in the network by measuring the dispari
defined as:
D~6
d2
1 zd2
2 zd2
3
(d1zd2zd3)2
{
1
3
,
where d1, d2 and d3 are the geographical distances between
locations of the users forming the triangle. The disparity t
values between 0 and 1 as the shape of the triangle passes f
equilateral to isosceles, where one edge is much smaller than
other two. D shows a distribution with two maxima in the on
social networks (Figure 2F), for low and high values. The two m
C(d).
doi:10.1371/journal.pone.0092196.g002
DL
Figure 5. Geo-social properties of the model networks. Various statistical properties are plotted for the networks obtaine
(red squares) and from simulation of the TF model (black line) for the US. Corresponding results for the UK and Germany can be
and S4.
doi:10.1371/journal.pone.0092196.g005
Coupling Mobility and Interactio
s, and so the distribution of triangles disparity prevents the model from producing networks with characteristics
al properties of the model networks. Various statistical properties are plotted for the networks obtained from Twitter data
m simulation of the TF model (black line) for the US. Corresponding results for the UK and Germany can be found in Figures S3
one.0092196.g005
Coupling Mobility and Interactions in Social Media
Prob of a Link
PLoS One 9, E92196 (2014)
64. www.bgoncalves.com@bgoncalves
Residents and Tourists J. R. Soc. Interface 12, 20150473 (2015)
50 100 150 200 250 300 350
0.1
0.2
0.3
0.4
0.5
0.6
Coverage
R
~
Local
Non−Local
a
100
200
300
400
500
600
0.2 0.3 0.4 0.5 0.6 0.7 0.8
Proportion of Non−Local Users
Coverage
b
125 135 145 155
New York
Chicago
San Francisco
Shanghai
Dallas
Berlin
Paris
Saint Petersburg
Beijing
Moscow
Coverage
c
325 335 345
Houston
Barcelona
Brussels
Detroit
Lima
Istanbul
Rome
Moscow
Paris
Lisbon
Coverage
d
65. www.bgoncalves.com@bgoncalves
City Communities J. R. Soc. Interface 12, 20150473 (2015)
0 2 4 6 8 10
Los Angeles
San Francisco
Miami
Singapore
Tokyo
Paris
London
New York
Weighted Betwennness (x 102
)
Weighted degree
69. www.bgoncalves.com@bgoncalves
Touristic Sites
0.4 0.5 0.6
Radius
Times Square
Niagara Falls
Angkor Wat
Grand Canyon
Machu Pichu
Giza
Forbidden City
Eiffel Tower
Pisa Tower
Taj Mahal
80 90 100 110 120
Coverage (cell)
Iguazu Falls
Giza
Times Square
Machu Pichu
Forbidden City
Niagara Falls
Eiffel Tower
Taj Mahal
Grand Canyon
Pisa Tower
20 24 28 32
Coverage (country)
London Tower
Times Square
Hagia Sophia
Machu Pichu
Angkor Wat
Forbidden City
Pisa Tower
Eiffel Tower
Giza
Taj Mahal
(a) (b) (c)
EPJ Data Science 5, 12 (2016)
71. www.bgoncalves.com@bgoncalves
Discussion
• Online Social Networks generate unprecedented amounts of data on Human
Behavior
• The massification of GPS-enabled devices allows us to observe Geographical
variations
• Human mobility is an intrinsically multi-scale process
• Twitter is a good source of geolocated data, but it has many biases that must be
considered
• Different types of links serve different social and information diffusion functions
• The strength of ties provides important clues to the social structure
• Colocation increases the likelihood of friendship
• Mobility and Social Structure mutually influence each other
• Mobility is a proxy for the centrality of a city or touristic locale
72. www.bgoncalves.com@bgoncalves
References
PLoS One 10, e0115545 (2015)
Sci Rep 3, 1376 (2013)
Social Networks 34, 73 (2012)
Social Networks 34, 82 (2012)
PLoS One 7, e29358 (2012)
ICWSM’11, 375 (2011)
PNAS 107, 22436 (2010)
Nature 453, 779 (2008)
PNAS 104, 7333 (2007)
EPJ Data Science 5, 12 (2016)
J. R. Soc. Interface 12, 20150473 (2015)
PLoS One 9, E92196 (2014)
PLoS One 8, E61981 (2013)
ICWSM’11, 89 (2011)
PNAS 106, 21484 (2009)