SlideShare a Scribd company logo
1 of 75
Download to read offline
www.bgoncalves.com@bgoncalves
Bruno Gonçalves
www.bgoncalves.com
Human Mobility
(with mobile devices)
www.bgoncalves.com@bgoncalves
Mobility
www.bgoncalves.com@bgoncalves
www.bgoncalves.com@bgoncalves
GPS-enabled Smartphones
www.bgoncalves.com@bgoncalves
www.bgoncalves.com@bgoncalves
www.bgoncalves.com@bgoncalves www.bgoncalves.com@bgoncalves
www.bgoncalves.com@bgoncalves
Social Media
www.bgoncalves.com@bgoncalves
Coding
www.bgoncalves.com@bgoncalves
Anatomy of a Tweet
https://github.com/bmtgoncalves/Mining-Georeferenced-Data
www.bgoncalves.com@bgoncalves
Anatomy of a Tweet
[u'contributors',
u'truncated',
u'text',
u'in_reply_to_status_id',
u'id',
u'favorite_count',
u'source',
u'retweeted',
u'coordinates',
u'entities',
u'in_reply_to_screen_name',
u'in_reply_to_user_id',
u'retweet_count',
u'id_str',
u'favorited',
u'user',
u'geo',
u'in_reply_to_user_id_str',
u'possibly_sensitive',
u'lang',
u'created_at',
u'in_reply_to_status_id_str',
u'place',
u'metadata']
https://github.com/bmtgoncalves/Mining-Georeferenced-Data
www.bgoncalves.com@bgoncalves
Anatomy of a Tweet
[u'contributors',
u'truncated',
u'text',
u'in_reply_to_status_id',
u'id',
u'favorite_count',
u'source',
u'retweeted',
u'coordinates',
u'entities',
u'in_reply_to_screen_name',
u'in_reply_to_user_id',
u'retweet_count',
u'id_str',
u'favorited',
u'user',
u'geo',
u'in_reply_to_user_id_str',
u'possibly_sensitive',
u'lang',
u'created_at',
u'in_reply_to_status_id_str',
u'place',
u'metadata']
[u'follow_request_sent',
u'profile_use_background_image',
u'default_profile_image',
u'id',
u'profile_background_image_url_https',
u'verified',
u'profile_text_color',
u'profile_image_url_https',
u'profile_sidebar_fill_color',
u'entities',
u'followers_count',
u'profile_sidebar_border_color',
u'id_str',
u'profile_background_color',
u'listed_count',
u'is_translation_enabled',
u'utc_offset',
u'statuses_count',
u'description',
u'friends_count',
u'location',
u'profile_link_color',
u'profile_image_url',
u'following',
u'geo_enabled',
u'profile_banner_url',
u'profile_background_image_url',
u'screen_name',
u'lang',
u'profile_background_tile',
u'favourites_count',
u'name',
u'notifications',
u'url',
u'created_at',
u'contributors_enabled',
u'time_zone',
u'protected',
u'default_profile',
u'is_translator']
https://github.com/bmtgoncalves/Mining-Georeferenced-Data
www.bgoncalves.com@bgoncalves
Anatomy of a Tweet
[u'contributors',
u'truncated',
u'text',
u'in_reply_to_status_id',
u'id',
u'favorite_count',
u'source',
u'retweeted',
u'coordinates',
u'entities',
u'in_reply_to_screen_name',
u'in_reply_to_user_id',
u'retweet_count',
u'id_str',
u'favorited',
u'user',
u'geo',
u'in_reply_to_user_id_str',
u'possibly_sensitive',
u'lang',
u'created_at',
u'in_reply_to_status_id_str',
u'place',
u'metadata']
[u'type',
u'coordinates']
[u'symbols',
u'user_mentions',
u'hashtags',
u'urls']
u'<a href="http://foursquare.com" rel=“nofollow">
foursquare</a>'
u"I'm at Terminal Rodovixe1rio de Feira de Santana
(Feira de Santana, BA) http://t.co/WirvdHwYMq"
{u'display_url': u'4sq.com/1k5MeYF',
u'expanded_url': u'http://4sq.com/1k5MeYF',
u'indices': [70, 92],
u'url': u'http://t.co/WirvdHwYMq'}
https://github.com/bmtgoncalves/Mining-Georeferenced-Data
www.bgoncalves.com@bgoncalves
API Basics https://dev.twitter.com/docs
• The twitter module provides the oauth interface. We just need to provide the right
credentials.
• Best to keep the credentials in a dict and parametrize our calls with the dict key. This way
we can switch between different accounts easily.
• .Twitter(auth) takes an OAuth instance as argument and returns a Twitter object that we
can use to interact with the API
• Twitter methods mimic API structure
• 4 basic types of objects:
• Tweets
• Users
• Entities
• Places
https://github.com/bmtgoncalves/Mining-Georeferenced-Data
www.bgoncalves.com@bgoncalves
User Timeline https://dev.twitter.com/docs/api/1.1/get/statuses/user_timeline
• .statuses.user_timeline() returns a set of tweets posted by a single user
• Important options:
• include_rts=‘true’ to Include retweets by this user
• count=200 number of tweets to return in each call
• trim_user=‘true’ to not include the user information (save bandwidth and processing
time)
• max_id=1234 to include only tweets with an id lower than 1234
• Returns at most 200 tweets in each call. Can get all of a users tweets (up to 3200) with
multiple calls using max_id
@bgoncalves
User Timeline https://dev.twitter.com/docs/api/1.1/get/statuses/user_timeline
import twitter
from twitter_accounts import accounts
app = accounts["social"]
auth = twitter.oauth.OAuth(app["token"],
app["token_secret"],
app["api_key"],
app["api_secret"])
twitter_api = twitter.Twitter(auth=auth)
screen_name = "bgoncalves"
args = { "count" : 200,
"trim_user": "true",
"include_rts": "true"
}
tweets = twitter_api.statuses.user_timeline(screen_name = screen_name, **args)
tweets_new = tweets
while len(tweets_new) > 0:
max_id = tweets[-1]["id"] - 1
tweets_new = twitter_api.statuses.user_timeline(screen_name = screen_name,
max_id=max_id, **args)
tweets += tweets_new
print "Found", len(tweets), "tweets"
timeline_twitter.py
www.bgoncalves.com@bgoncalves
Streaming Geocoded data
• The Streaming api provides realtime data, subject to filters
• Use TwitterStream instead of Twitter object (.TwitterStream(auth=twitter_api.auth))
• .status.filter(track=q) will return tweets that match the query q in real time
• Returns generator that you can iterate over
• .status.filter(locations=bb) will return tweets that occur within the bounding box bb in
real time
• bb is a comma separated pair of lat/lon coordinates.
• -180,-90,180,90 - World
• -74,40,-73,41 - NYC
https://github.com/bmtgoncalves/Mining-Georeferenced-Data
www.bgoncalves.com@bgoncalves
Streaming Geocoded data
import twitter
from twitter_accounts import accounts
import gzip
app = accounts["social"]
auth = twitter.oauth.OAuth(app["token"],
app["token_secret"],
app["api_key"],
app["api_secret"])
stream_api = twitter.TwitterStream(auth=auth)
query = "-74,40,-73,41" # NYC
stream_results = stream_api.statuses.filter(locations =
query)
tweet_count = 0
fp = gzip.open("NYC.json.gz", "a")
for tweet in stream_results:
try:
tweet_count += 1
print tweet_count, tweet[“id”]
print >> fp, tweet
except:
pass
location_twitter.py
https://github.com/bmtgoncalves/Mining-Georeferenced-Data
www.bgoncalves.com@bgoncalves
www.bgoncalves.com@bgoncalves
GPS Coordinates
www.bgoncalves.com@bgoncalves www.bgoncalves.com@bgoncalves
World Population
www.bgoncalves.com@bgoncalves
Biases
www.bgoncalves.com@bgoncalves
Biases
www.bgoncalves.com@bgoncalves
Market Penetration PLoS One 8, E61981 (2013)
www.bgoncalves.com@bgoncalves
Age Distribution
PLoS One 10, e0115545 (2015)
www.bgoncalves.com@bgoncalves
Age Distribution
www.bgoncalves.com@bgoncalves
Demographics ICWSM’11, 375 (2011)
users who we could infer a gender for, based on their name
and the list previously described. We do so by comparing
the first word of their self-reported name to the gender list.
We observe that there exists a match for 64.2% of the users.
Moreover, we find a strong bias towards male users: Fully
71.8% of the the users who we find a name match for had a
male name.
0
0.2
0.4
0.6
0.8
1
2007-01 2007-07 2008-01 2008-07 2009-01 2009-07
FractionofJoiningUsers
whoareMale
Date
Figure 3: Gender of joining users over time, binned into
groups of 10,000 joining users (note that the join rate in-
creases substantially). The bias towards male users is ob-
served to be decreasing over time.
each last name with over 100 individuals in the U.S.
ing the 2000 Census, the Census releases the distributio
race/ethnicity for that last name. For example, the last n
“Myers” was observed to correspond to Caucasians 86%
the time, African-Americans 9.7%, Asians 0.4%, and
panics 1.4%.
Race/ethnicity distribution of Twitter users
We first determined the number of U.S.-based users
whom we could infer the race/ethnicity by comparing
last word of their self-reported name to the U.S. Ce
last name list. We observed that we found a match
71.8% of the users. We the determined the distributio
race/ethnicity in each county by taking the race/ethn
distribution in the Census list, weighted by the freque
of each name occurring in Twitter users in that coun
Due to the large amount of ambiguity in the last name
race/ethnicity list (in particular, the last name list is m
than 95% predictive for only 18.5% of the users), we are
able to directly compare the Twitter race/ethnicity distr
1
This is effectively the census.model approach discuss
prior work (Chang et al. 2010).
(a) Normal representation
Figure 2: Per-county over- and underrepresentation of U.S. po
tation rate of 0.324%, presented in both (a) a normal layout an
Blue colors indicate underrepresentation, while red colors repre
to the log of the over- or underrepresentation rate. Clear trend
overrepresentation of populous counties.
less than 95% predictive (e.g., the name Avery was observed
to correspond to male babies only 56.8% of the time; it was
Undersampling
Oversampling
(a) Caucasian (non-hispanic) (b) African-American (c) Asian or Pacific Islander (d) Hispanic
Figure 4: Per-county area cartograms of Twitter over- and undersampling rates of Caucasian, African-American, Asian, and
Hispanic users, relative to the 2000 U.S. Census. Only counties with more than 500 Twitter users with inferred race/ethnicity
are shown. Blue regions correspond to undersampling; red regions to oversampling.
www.bgoncalves.com@bgoncalves
Language and Geography PLoS One 8, E61981 (2013)
Spanish
English
Geography
www.bgoncalves.com@bgoncalves
Multilayer Network
www.bgoncalves.com@bgoncalves
Information
Layer(s)
Social
Layer(s)
www.bgoncalves.com@bgoncalves
Link Function ICWSM’11, 89 (2011)
Agreement Discussion
www.bgoncalves.com@bgoncalves
wo possible cases in networks with
: ͑a͒ positively correlated nets and ͑b͒
width of the line of the links represents
CAL REVIEW E 76, 066106 ͑2007͒
A
C
B
kin = 1
kout = 2
sin = 1
sout = 2
kin = 2
kout = 1
sin = 3
sout = 1 kin = 1
kout = 1
sin = 1
sout = 2
Figure 2: Example of a meme diffusion network involving
three users mentioning and retweeting each other. The val-
ues of various node statistics are shown next to each node.
The strength s refers to weighted degree, k stands for degree.
Observing a retweet at node B provides implicit confirma-
tion that information from A appeared in B’s Twitter feed,
while a mention of B originating at node A explicitly con-
firms that A’s message appeared in B’s Twitter feed. This
may or may not be noticed by B, therefore mention edges
are less reliable indicators of information flow compared to
retweet edges.
Retweet and reply/mention information parsed from the
text can be ambiguous, as in the case when a tweet is marked
as being a ‘retweet’ of multiple people. Rather, we rely
on Twitter metadata, which designates users replied to or
retweeted by each message. Thus, while the text of a tweet
may contain several mentions, we only draw an edge to the
user explicitly designated as the mentioned user by the meta-
data. In so doing, we may miss retweets that do not use the
explicit retweet feature and thus are not captured in the meta-
data. Note that this is separate from our use of mentions as
memes (§ 3.1), which we parse from the text of the tweet.
4 System Architecture
Figure 3
website,
memes.
detailed
per day
lion twe
process
network
to produ
acteristic
analyses
sification
4.2 M
A secon
The Strength of Ties
www.bgoncalves.com@bgoncalves
Weak
wo possible cases in networks with
: ͑a͒ positively correlated nets and ͑b͒
width of the line of the links represents
CAL REVIEW E 76, 066106 ͑2007͒
A
C
B
kin = 1
kout = 2
sin = 1
sout = 2
kin = 2
kout = 1
sin = 3
sout = 1 kin = 1
kout = 1
sin = 1
sout = 2
Figure 2: Example of a meme diffusion network involving
three users mentioning and retweeting each other. The val-
ues of various node statistics are shown next to each node.
The strength s refers to weighted degree, k stands for degree.
Observing a retweet at node B provides implicit confirma-
tion that information from A appeared in B’s Twitter feed,
while a mention of B originating at node A explicitly con-
firms that A’s message appeared in B’s Twitter feed. This
may or may not be noticed by B, therefore mention edges
are less reliable indicators of information flow compared to
retweet edges.
Retweet and reply/mention information parsed from the
text can be ambiguous, as in the case when a tweet is marked
as being a ‘retweet’ of multiple people. Rather, we rely
on Twitter metadata, which designates users replied to or
retweeted by each message. Thus, while the text of a tweet
may contain several mentions, we only draw an edge to the
user explicitly designated as the mentioned user by the meta-
data. In so doing, we may miss retweets that do not use the
explicit retweet feature and thus are not captured in the meta-
data. Note that this is separate from our use of mentions as
memes (§ 3.1), which we parse from the text of the tweet.
4 System Architecture
Figure 3
website,
memes.
detailed
per day
lion twe
process
network
to produ
acteristic
analyses
sification
4.2 M
A secon
The Strength of Ties
www.bgoncalves.com@bgoncalves
Weak
• Interviews to find out how individuals found out about job opportunities.
• Mostly from acquaintances or friends of friends
• “It is argued that the degree of overlap of two individuals social networks varies directly
with the strength of their tie to one another”
wo possible cases in networks with
: ͑a͒ positively correlated nets and ͑b͒
width of the line of the links represents
CAL REVIEW E 76, 066106 ͑2007͒
A
C
B
kin = 1
kout = 2
sin = 1
sout = 2
kin = 2
kout = 1
sin = 3
sout = 1 kin = 1
kout = 1
sin = 1
sout = 2
Figure 2: Example of a meme diffusion network involving
three users mentioning and retweeting each other. The val-
ues of various node statistics are shown next to each node.
The strength s refers to weighted degree, k stands for degree.
Observing a retweet at node B provides implicit confirma-
tion that information from A appeared in B’s Twitter feed,
while a mention of B originating at node A explicitly con-
firms that A’s message appeared in B’s Twitter feed. This
may or may not be noticed by B, therefore mention edges
are less reliable indicators of information flow compared to
retweet edges.
Retweet and reply/mention information parsed from the
text can be ambiguous, as in the case when a tweet is marked
as being a ‘retweet’ of multiple people. Rather, we rely
on Twitter metadata, which designates users replied to or
retweeted by each message. Thus, while the text of a tweet
may contain several mentions, we only draw an edge to the
user explicitly designated as the mentioned user by the meta-
data. In so doing, we may miss retweets that do not use the
explicit retweet feature and thus are not captured in the meta-
data. Note that this is separate from our use of mentions as
memes (§ 3.1), which we parse from the text of the tweet.
4 System Architecture
Figure 3
website,
memes.
detailed
per day
lion twe
process
network
to produ
acteristic
analyses
sification
4.2 M
A secon
The Strength of Ties (1973)
www.bgoncalves.com@bgoncalves
Weak
• Interviews to find out how individuals found out about job opportunities.
• Mostly from acquaintances or friends of friends
• “It is argued that the degree of overlap of two individuals social networks varies directly
with the strength of their tie to one another”
wo possible cases in networks with
: ͑a͒ positively correlated nets and ͑b͒
width of the line of the links represents
CAL REVIEW E 76, 066106 ͑2007͒
A
C
B
kin = 1
kout = 2
sin = 1
sout = 2
kin = 2
kout = 1
sin = 3
sout = 1 kin = 1
kout = 1
sin = 1
sout = 2
Figure 2: Example of a meme diffusion network involving
three users mentioning and retweeting each other. The val-
ues of various node statistics are shown next to each node.
The strength s refers to weighted degree, k stands for degree.
Observing a retweet at node B provides implicit confirma-
tion that information from A appeared in B’s Twitter feed,
while a mention of B originating at node A explicitly con-
firms that A’s message appeared in B’s Twitter feed. This
may or may not be noticed by B, therefore mention edges
are less reliable indicators of information flow compared to
retweet edges.
Retweet and reply/mention information parsed from the
text can be ambiguous, as in the case when a tweet is marked
as being a ‘retweet’ of multiple people. Rather, we rely
on Twitter metadata, which designates users replied to or
retweeted by each message. Thus, while the text of a tweet
may contain several mentions, we only draw an edge to the
user explicitly designated as the mentioned user by the meta-
data. In so doing, we may miss retweets that do not use the
explicit retweet feature and thus are not captured in the meta-
data. Note that this is separate from our use of mentions as
memes (§ 3.1), which we parse from the text of the tweet.
4 System Architecture
Figure 3
website,
memes.
detailed
per day
lion twe
process
network
to produ
acteristic
analyses
sification
4.2 M
A secon
The Strength of Ties (1973)
for a time sufficient to its
ale communication network
nd the calls among them links.
indicates a particular egocentric network evolution. In order to
quantify it, we measure the probability, p(n), that the next
communication event of an agent having n social ties will occur via
the establishment of a new (n 1 1)th
link. We calculate these
probabilities in the MPC dataset averaging them for users with the
same degree k at the end of the observation time. We therefore
. Panels (a), and (b) show calls within 3 hours between people in the same town in two different time windows.
al network structure, which was recorded by aggregating interactions during 6 months. Node size and colors
idth and color represent weight.
www.bgoncalves.com@bgoncalves
Neighborhood Overlap PNAS 104, 7333 (2007)
10
0
10
1
10
2
10
10
6
10
4
100 102 104 106 108
10
10 12
10 10
10 8
10 6
vi vj
Oij=0 Oij=1/3
Oij=1Oij=2/3
<O>
w
,<O>
b
0 0.2 0.4 0.6 0.8 1
0
0.05
0.1
0.15
0.2
P
cum
(w), P
cum
(b)
C D
Degree k Link weight w (s)
P(k)
P(w)
Fig. 1. Characterizing the large-scale structure and the tie strengths of the
mobile call graph. (A and B) Vertex degree (A) and tie strength distribution (B).
Each distribution was fitted with P(x) ϭ a(x ϩ x0)Ϫx exp(Ϫx/xc), shown as a blue
curve, where x corresponds to either k or w. The parameter values for the fits
weight
betweeness
reshuffled
www.bgoncalves.com@bgoncalves
Strong Ties have higher overlaps PNAS 104, 7333 (2007)
A
B
1
100
10
10
2 100 102 104 106 108
10
10 12
10 10
Oij=1/3
Oij=1
<O>
w
,<O>
b
0 0.2 0.4 0.6 0.8 1
0
0.05
0.1
0.15
0.2
P
cum
(w), P
cum
(b)
C D
e k Link weight w (s)
g the large-scale structure and the tie strengths of the
d B) Vertex degree (A) and tie strength distribution (B).
tted with P(x) ϭ a(x ϩ x0)Ϫx exp(Ϫx/xc), shown as a blue
onds to either k or w. The parameter values for the fits
kc ϭ ϱ (A, degree), and w0 ϭ 280, ␥w ϭ 1.9, wc ϭ 3.45 ϫ
tration of the overlap between two nodes, vi and vj, its
r four local network configurations. (D) In the real
O͘w (blue circles) increases as a function of cumulative
representing the fraction of links with tie strength
adic hypothesis is tested by randomly permuting the
s the coupling between ͗O͘w and w (red squares). The
as a function of cumulative link betweenness centrality
B
C
1
100
10
.
one communication.
102 104 106 108
0.4 0.6 0.8 1
P
cum
(w), P
cum
(b)
D
Link weight w (s)
the tie strengths of the
strength distribution (B).
p(Ϫx/xc), shown as a blue
meter values for the fits
80, ␥w ϭ 1.9, wc ϭ 3.45 ϫ
n two nodes, vi and vj, its
rations. (D) In the real
a function of cumulative
links with tie strength
andomly permuting the
and w (red squares). The
k betweenness centrality
B
C
1
100
10
.
APPLIEDPHYSICAL
SCIENCES
Real Randomized
Betweeness
www.bgoncalves.com@bgoncalves
Network Structure PLoS One 7, e29358 (2012)
The Strength of Intermediary Ties in Social Media
“People whose networks bridge the structural holes
between groups have an advantage in detecting and
developing rewarding opportunities. Information
arbitrage is their advantage. They are able to see
early, see more broadly, and translate information
across groups.”
AJS Volume 110 Number 2 (September 2004): 349–99
᭧ 2004 by The University of Chicago. All rights reserved.
0002-9602/2004/11002-0004$10.00
Structural Holes and Good Ideas1
Ronald S. Burt
University of Chicago
This article outlines the mechanism by which brokerage prov
social capital. Opinion and behavior are more homogeneous w
than between groups, so people connected across groups are m
familiar with alternative ways of thinking and behaving. Broke
across the structural holes between groups provides a vision o
tions otherwise unseen, which is the mechanism by which broke
becomes social capital. I review evidence consistent with the
pothesis, then look at the networks around managers in a
American electronics company. The organization is rife with s
tural holes, and brokerage has its expected correlates. Compensa
positive performance evaluations, promotions, and good idea
disproportionately in the hands of people whose networks
structural holes. The between-group brokers are more likely t
press ideas, less likely to have ideas dismissed, and more like
have ideas evaluated as valuable. I close with implications for
ativity and structural change.
The hypothesis in this article is that people who stand near the hol
a social structure are at higher risk of having good ideas. The argum
is that opinion and behavior are more homogeneous within than betw
groups, so people connected across groups are more familiar with a
1
Portions of this material were presented as the 2003 Coleman Lecture at the Univ
of Chicago, at the Harvard-MIT workshop on economic sociology, in worksho
the University of California at Berkeley, the University of Chicago, the Univers
Kentucky, the Russell Sage Foundation, the Stanford Graduate School of Bus
the University of Texas at Dallas, Universiteit Utrecht, and the “Social Aspe
Rationality” conference at the 2003 meetings of the American Sociological Associ
I am grateful to Christina Hardy for her assistance on the manuscript and to se
colleagues for comments affecting the final text: William Barnett, James Baron
athan Bendor, Jack Birner, Matthew Bothner, Frank Dobbin, Chip Heath, R
Kranton, Rakesh Khurana, Jeffrey Pfeffer, Joel Podolny, Holly Raider, James R
Don Ronchi, Ezra Zuckerman, and two AJS reviewers. I am especially grate
Peter Marsden for his comments as discussant at the Coleman Lecture. Direc
respondence to Ron Burt, Graduate School of Business, University of Chicago
cago, Illinois 60637. E-mail: ron.burt@gsb.uchicago.edu
www.bgoncalves.com@bgoncalves
Network Structure PLoS One 7, e29358 (2012)
ation that the stronger the tie is the higher
acts of both parties it has and the higher the
belong to the same group.
groups
to consider is the characteristics of links
ese links occur mainly between groups
200 users (Figure 4A–C). However, their
he quality of the links (if they bear mentions
ks with mentions are less abundant than the
retweets are slightly more abundant.
ngth of weak ties theory [12,14–16], weak
between which they take place should be small according to the
Granovetter’s theory. The results show that the most likely to
attract retweets are the links connecting groups that are neither too
close nor too far. This can be explained with Aral’s theory about
the trade-off between diversity and bandwidth: if the two groups
are too close there is no enough diversity in the information, while
if the groups are too far the communication is poor. These trends
are not dependant on the size of the considered groups (see Figs.
S6, S7, S8, S9, S10, S11, S12, S13, S14 and Table S1 in the
Supplementary Information).
ink statistics. (A) Size distribution of the group. (B) Distribution of the number of groups to which each user is assigned.
f different types, e.g. follower links (black bars), links with mentions (red bars) or retweets (green bars), staying in particular
in respect to detected groups.
.0029358.g002
The Strength of Intermediary Ties in Social Media
to Granovetter expectation that the stronger the
number of mutual contacts of both parties it has a
Figure 2. Group and link statistics. (A) Size distri
(C) Percentage of links of different types, e.g. followe
topological localizations in respect to detected grou
doi:10.1371/journal.pone.0029358.g002
The
www.bgoncalves.com@bgoncalves
Groups PLoS One 7, e29358 (2012)
Figure 4. Group-group activity. (A) Distribution of the number of links in the follower network between groups as a function of the
groups. (B) Fractions f of links of the different types (follower, with mentions and with retweets) as a function of the size of the group
The Strength of Intermediary Ties in So
2.4 Links between groups
The next question to consider is the characteristic
between groups. These links occur mainly betwee
containing less than 200 users (Figure 4A–C). Howe
frequency depends on the quality of the links (if they bear
or retweets). While links with mentions are less abundan
baseline, those with retweets are slightly more
According to the strength of weak ties theory [12,14–
links are typically connections between persons no
neighbors, being important to keep the network conn
for information diffusion. We investigate whether
between groups play a similar role in the online n
information transmitters. The actions more related to in
diffusion are retweets [24] that show a slight prefe
occurring on between-group links (Figures 4B and
preference is enhanced when the similarity between
groups is taken into account. We define the similarity be
groups, A and B, in terms of the Jaccard index
connections:
similarity(A,B)~
jlinks of A and Bj
j|links of A and Bj
:
The similarity is the overlap between the groups’ connec
it estimates network proximity of the groups. The gener
is that links with mentions more likely occur between clo
and retweets occur between groups with medium
(Figure 4D). Mentions as personal messages are
exchanged between users with similar environments
predicted by the strength of weak ties theory. Links with
are related to information transfer and the similarity of t
PLoS ONE | www.plosone.org
Geography
www.bgoncalves.com@bgoncalves
Multilayer Network
www.bgoncalves.com@bgoncalves
Information
Layer(s)
Social
Layer(s)
Geographical
Layer(s)
www.bgoncalves.com@bgoncalves
Twitter Follower Distance Social Networks 34, 73 (2012)
Y. Takhteyev et al. / Social Networks 34 (2012) 73–81
f physical distances between egos and alters. The graph shows the number of ties by distance, in 200 km bins (for example, New
ed towards the 5400 km bin). The total number of ties in each of the two simulations is the same as in the observed data. Based on
www.bgoncalves.com@bgoncalves
Locality Social Networks 34, 73 (2012)
Y. Takhteyev et al. / Social Networks 34 (2012) 73–81 79
Table 5
Top countries.
Share of
egos (%)a
Share of egos
(%) for egos in
dyadsb
Share of
alters (%)c
Percentage of
domestic tiesd
Percentage of
domestic ties among
non-local tiesd
Following foreign
alters/being followed
from abroad
Country named
explicitly (% of
egos)
USA 48.5 45.7 54.5 91.6 89.3 0.3 8.1
Brazil 10.6 12.1 10.5 83.5 72.5 4.9 55.4
UK 7.6 8.3 7.6 50.6 33.3 1.2 45.3
Japan 5.5 6.5 6.3 92.1 86.0 1.4 25.0
Canada 3.7 3.8 2.9 33.3 23.1 1.6 58.5
Australia 2.7 2.7 1.9 50.0 32.0 2.2 69.7
Indonesia 2.6 1.8 1.2 60.0 25.0 7.0 83.3
Germany 2.1 1.8 1.3 62.9 58.8 3.2 58.6
Netherlands 1.4 1.4 1.2 66.7 22.2 1.5 54.3
Mexico 1.2 1.3 0.7 44.0 8.3 7.0 56.7
a
Out of the 2852 egos located at the level of country or better.
b
Out of the egos included in 1953 dyads with both parties located at the level of country or better.
c
Out of the 1953 alters located at the level of country or better.
d
The number of ties with the ego and the alter in the given country as a share of all ties for egos in that country.
between those two interpretations. We also note that top Twitter
clusters intersect only to an extent with Alderson and Beckfield’s
(2004) ranking of world cities based on multinational corporations’
branch headquarters. (Of Alderson and Beckfield’s top 25 cities by
in-degree or “prestige,” 13 appear in the top 25 Twitter clusters
ranked by in-degree centrality, with another 6 appearing in top
100.)
5.3. National borders
Of the ties that were matched to countries, 75 percent con-
nect users in the same country. This prevalence of domestic ties is
Table 6
The most common languages. Based on 2852 egos.
Language % of egos
English 72.5
Portuguese 10.1
Japanese 5.4
Spanish 3.1
Indonesian 1.8
German 1.7
Dutch 1.0
Chinese 0.9
Korean 0.4
Swedish 0.4
Y. Takhteyev et al. / Social Networks 34 (2012) 73–81 77
accounts, by randomly drawing an account from among those “fol-
lowed” by each of those egos. We then coded the locations of the
alters using the same procedure as we did for the egos, removing
those pairs where the alter could not be assigned to a country. In
the end, we obtained a sample of 1953 ego-alter pairs with both
the ego and the alter assigned to a country, including 1259 pairs
with “specific” locations for both parties (Table 1).
4.4. Aggregating nearby locations
Since specific locations vary substantially in precision and since
users can often choose between a range of specific names for the
same place (e.g., “Palo Alto” vs. “Silicon Valley” vs. “SF Bay”), we
aggregated nearby locations within each country, by assigning a
set of coordinates (obtained from Google Maps) to each location
smaller than 25,000 km2 and then merging nearby locations within
each country by replacing their coordinates with a weighted aver-
age of the coordinates of the merged locations. This reduced our
location descriptions to a set of 386 regional clusters, which are
comparable in size to metropolitan areas. We labeled each clus-
ter with the most common name associated with it in our sample.
For example, the cluster centered on Manhattan is referred to as
“New York.”
5. Analysis
In this section we analyze the factors affecting the formation of
Twitter ties. We first look at the effect of each variable identified
earlier based on theoretical considerations: the actual physical dis-
tance, the frequency of air travel, national boundaries, and language
differences. In addition to presenting the descriptive statistics
demonstrating the effects of each variable and investigating the
nature of such effects, we correlated the effects using the Quadratic
Assignment Procedure (QAP, Krackhardt, 1987; Butts, 2007). In the
last subsection we also examined the relationship between the
variables using QAP regression (Double Dekker Semi-partialling
MRQAP). All statistical calculations were done using UCINet 6.277
(Borgatti et al., 2002).
For correlation and regression analysis we used networks with
nodes representing the 25 largest regional clusters of users (see
Table 3
Top clusters.
Rank Clustera
Share of
egos (%)b
Share of egos
(%) for egos in
dyadsc
Share of
alters (%)d
Localitye
1 “New York” 8.5 8.3 10.2 54.3
2 “Los Angeles, CA” 5.1 5.6 10.4 53.3
3 “ ” (Tokyo) 4.1 4.8 5.0 62.9
4 “London” 3.6 3.3 4.9 48.8
5 “São Paulo” 3.5 3.0 3.6 78.4
6 “San Francisco” 2.8 2.7 4.1 41.2
7 “New Jersey”f
2.5 2.8 2.1 20.0
8 “Chicago” 2.2 2.0 1.7 32.0
9 “Washington, DC” 2.1 2.8 2.6 34.3
10 “Manchester, UK” 1.9 2.0 1.1 30.8
11 “Atlanta” 1.7 2.1 2.1 46.2
12 “San Diego” 1.5 1.5 1.1 26.3
13 “Toronto, Canada” 1.3 1.1 1.5 42.9
14 “Seattle” 1.3 1.4 1.2 58.8
15 “Houston” 1.2 1.2 1.0 40.0
16 “Dallas, Texas” 1.2 1.0 1.4 61.5
17 “Rio de Janeiro” 1.2 1.0 1.1 30.8
18 “Boston, MA” 1.2 1.2 1.1 20.0
19 “Amsterdam” 1.1 1.1 0.9 50.0
20 “Jakarta, Indonesia” 1.1 0.6 0.3 42.9
21 “Austin, TX” 1.0 1.0 1.3 50.0
22 “Sydney” 0.9 1.0 0.8 38.5
23 “Orlando, Forida” 0.9 1.0 0.6 16.7
24 “Phoenix, AZ” 0.8 0.7 0.6 11.1
25 “ ” (Hy¯ogo)g
0.8 1.0 1.0 25.0
a
Each cluster is labeled with the name most frequently used for locations assigned
to the cluster.
b
Out of the 2167 egos located with precision of <25,000 km2
.
c
Out of the 1259 egos included in dyads with both parties located with precision
of <25,000 km2
.
d
Out of the 1259 alters included in dyads with both parties located with precision
of <25,000 km2
.
e
Defined as the share of local of ties among all ties for egos in a cluster.
f
Centered between Philadelphia and Trenton, NJ and includes all locations iden-
tified as just “New Jersey”.
g
Centered near the boundary between Hy¯ogo and Osaka prefectures, in the Kansai
region of Japan.
over half of the egos are in other countries, as are 4 of the 10
largest clusters: Tokyo, São Paulo, and two clusters in the United
www.bgoncalves.com@bgoncalves www.bgoncalves.com@bgoncalves
World Population
www.bgoncalves.com@bgoncalves
Population Heterogeneity Social Networks 34, 82 (2012)
• Bernoulli process to generate adjacency matrix given a distance matrix between nodes





• Above some density threshold, networks is naturally connected.et al. / Social Networks 34 (2012) 82–100 85
Fig. 3. Emergence of local connectivity on an uneven population density surface.
Where the threshold population density for an approximately uniform region of
P (A = a|D) =
Y
{i,j}
B (Aij = aij|F (Dij, ✓))
! ( )
www.bgoncalves.com@bgoncalves
Vertex Placement Social Networks 34, 82 (2012)
88 C.T. Butts et al. / Social Networks 34 (2012) 82–100
Fig. 5. Comparison of uniform and quasi-random vertex placement, Quay County, NM MSA. Lines indicate census block boundaries, with artificial elevation shown via vertex
color. Insets provide detail of 2 km × 2 km portion of Tucumcari, NM. (For interpretation of the references to color in this figure legend, the reader is referred to the web
version of the article.)
www.bgoncalves.com@bgoncalves
Friendship probability Social Networks 34, 82 (2012)
• Probability that two people are friends as a function of distance:





• with (0.533, 0.032, 2.788) for “social friendships” and (0.859, 0.035, 6.437) for “face-
to-face interactions”.
F (d) =
✓1
(1 + ✓2d)
✓3
www.bgoncalves.com@bgoncalves
Social Network Properties Social Networks 34, 82 (2012)C.T. Butts et al. / Social Networks 34 (2012) 82–100 97
Fig. 12. Marginal degree distributions by location, SIF, and placement model. Friendship model distributions are shown in blue, interaction model distributions in black;
solid lines indicate uniform placement, with quasi-random placement in dotted lines. (For interpretation of the references to color in this figure legend, the reader is referred
to the web version of the article.)
www.bgoncalves.com@bgoncalves
Co-occurences and Social Ties PNAS 107, 22436 (2010)
• Geotagged Flickr Photos
• Divide the world into a grid























Count number of cells on which two individuals were within a given interval
randomly selected Flickr users have a 0.0134% chance of having
a social tie, but when two users have multiple spatio-temporal co-
A Model of Spatio-Temp
small number of co-occu
greater probabilities of a
investigation of the und
basic effect is a robust on
models of social netwo
probabilistic model for h
We begin with a simpl
matches the observed d
To formulate the sim
divided into N geograp
There are M people, eac
network consists of M∕
friends chooses to visit
dependently with proba
location(s) is made un
the probability that two
they visit exactly the sam
A Jan 3
+
A Jan 1
+
A Jan 6
+
A Jan 5
+
B Jan 2
+
B Jan 1
+
B Jan 7
+
B Jan 8
+
A Jan 1
+
B Jan 1
+
s
s
A Jan 8
+
B Jan 1
+
Fig. 1. Illustration of how spatio-temporal co-occurrences are counted, for
some sample time-stamped observations of individuals A and B. The world is
divided into discrete cells of size s × s, and we count the number of cells k in
which the two individuals have been observed within a time threshold of t
days—in this case, k ¼ 3 when t is 2.
www.bgoncalves.com@bgoncalves
Co-occurences and Social Ties PNAS 107, 22436 (2010)
0 5 10 15 20
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
# of contemporaneous events
Probabilityoffriendship
1 day
7 days
14 days
28 days
1 year
0 5 10 15 20
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
# of contemporaneous events
Probabilityoffriendship
1 day
7 days
14 days
28 days
1 year
0 5 10 15 20
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
# of contemporaneous events
Probabilityoffriendship
1 day
7 days
14 days
28 days
1 year
0 5 10 15 20
10
3
10
2
10
1
10
0
# of contemporaneous events
Probabilityoffriendship
1 day
7 days
14 days
28 days
1 year
0 5 10 15 20
10
3
10
2
10
1
10
0
# of contemporaneous events
Probabilityoffriendship
1 day
7 days
14 days
28 days
1 year
0 5 10 15 20
10
3
10
2
10
1
10
0
# of contemporaneous events
Probabilityoffriendship
1 day
7 days
14 days
28 days
1 year
A B C
0.6
0.7
0.8
0.9
1
friendship
1 day
7 days
14 days
28 days
1 year
0.6
0.7
0.8
0.9
1
friendship
1 day
7 days
14 days
28 days
1 year
s = 0 .001•
s = 0 .01•
s = 0 .1•
www.bgoncalves.com@bgoncalves
Human Mobility Nature 453, 779 (2008)
on should increase with time as rg(t) , t
for an RW, rg(t) , t1/2
; that is, the longer we
higher the chance that she/he will travel to areas
To check the validity of these predictions, we
dependence of the radius of gyration for users
ius would be considered small (rg(T) # 3 km),
) # 30 km) or large (rg(T) . 100 km) at the end
period (T 5 6 months). The results indicate that
is, F(x) , x for x , 1 and F(x) rapidly decreases for x ? 1.
Therefore, the travel patterns of individual users may be approxi-
mated by a Le´vy flight up to a distance characterized by rg. Most
important, however, is the fact that the individual trajectories are
bounded beyond rg; thus, large displacements, which are the source
of the distinct and anomalous nature of Le´vy flights, are statistically
absent. To understand the relationship between the different expo-
nents, we note that the measured probability distributions are related
n mobility patterns. a, Week-long trajectory of 40
ndicates that most individuals travel only over short
egularly move over hundreds of kilometres. b, The
a single user. The different phone towers are shown as
for each location is shown as a vertical bar. The circle represents the radius of
gyration centred in the trajectory’s centre of mass. c, Probability density
function P(Dr) of travel distances obtained for the two studied data sets D1
and D2. The solid line indicates a truncated power law for which the
P ( r) = ( r + r0) exp ( rg/)
Cell Phones
www.bgoncalves.com@bgoncalves
Human Mobility Nature 453, 779 (2008)
Received 19 December 2007; accepted 27 March 2008. 20. Baraba´si, A.-L. The origin of bursts and heavy tails in human dynamics. Nat
–15 0 15
–15
0
15
–150 0 150
–150
0
150
–1,200 0 1,200
–1,200
0
1,200
Fmin
F max
x (km) x (km) x (km)
y(km)
y(km)
y(km)y/sy
y/sy
x/sx
0.1
0.2
0.3
0.4
sy/sx
10 100 1,0001
rg (km)
–15 0 15
–15
0
15
–15
0
15
–15
0
15
–15 0 15 –15 0 15
–10 –5 0 5 10
10–6
10–4
10–2
100
(x/sx,0)
rg ≤ 3 km
20 km < rg < 30 km
rg > 100 km
~
10–2
10–3
10–4
10–5
10–6
a
b
dc y/sy
x/sx x/sx
x/sx
F
Figure 3 | The shape of human trajectories.
a, The probability density function W(x, y) o
finding a mobile phone user in a location (x, y
the user’s intrinsic reference frame (see
Supplementary Information for details). The
three plots, from left to right, were generated
10,000 users with: rg # 3, 20 , rg # 30 and
rg . 100 km. The trajectories become more
anisotropic as rg increases. b, After scaling ea
position with sx and sy, the resulting
~WW x=sx,y

sy
À Á
has approximately the same sh
for each group. c, The change in the shape of
W(x, y) can be quantified calculating the isotr
ratio S ; sy/sx as a function of rg, which decre
as S*r{0:12
g (solid line). Error bars represent
standard error. d, ~WW x=sx,0ð Þ representing th
x-axis cross-section of the rescaled distributi
~WW x=sx,y

sy
À Á
shown in b.
LETTERS NATURE|Vol 453|5 June 20
Cell Phones
www.bgoncalves.com@bgoncalves
Privacy Sci Rep 3, 1376 (2013)
function fits the data better than other two-parameters functions
such as a 2 exp (lx), a stretched exponential a 2 exp xb
, or a
standard linear function a 2 bx (see Table S1). Both estimators for
a and b are highly significant (p , 0.001)32
, and the mean pseudo-R2
is 0.98 for the Ip54 case and the Ip510 case. The fit is good at all levels
of spatial and temporal aggregation [Fig. S3A–B].
The power-law dependency of e means that, on average, each time
the spatial or temporal resolution of the traces is divided by two, their
uniqueness decreases by a constant factor , (2)2b
. This implies that
privacy is increasingly hard to gain by lowering the resolution of a
dataset.
to larger populations, or geographies. An increase in population
density will tend to decrease e. Yet, it will also be accompanied by
an increase in the number of antennas, businesses or WiFi hotspots
used for localizations. These effects run opposite to each other, and
therefore, suggest that our results should generalize to higher popu-
lation densities.
Extensions of the geographical range of observation are also
unlikely to affect the results as human mobility is known to be highly
circumscribed. In fact, 94% of the individuals move within an average
radius of less than 100 km17
. This implies that geographical exten-
sions of the dataset will stay locally equivalent to our observations,
Figure 2 | (A) Ip52 means that the information available to the attacker consist of two 7am-8am spatio-temporal points (I and II). In this case, the target
was in zone I between 9am to 10am and in zone II between 12pm to 1pm. In this example, the traces of two anonymized users (red and green) are
compatible with the constraints defined by Ip52. The subset S(Ip52) contains more than one trace and is therefore not unique. However, the green trace
would be uniquely characterized if a third point, zone III between 3pm and 4pm, is added (Ip53). (B) The uniqueness of traces with respect to the number
p of given spatio-temporal points (Ip). The green bars represent the fraction of unique traces, i.e. |S(Ip)| 5 1. The blue bars represent the fraction of |S(Ip)|
# 2. Therefore knowing as few as four spatio-temporal points taken at random (Ip54) is enough to uniquely characterize 95% of the traces amongst 1.5 M
users. (C) Box-plot of the minimum number of spatio-temporal points needed to uniquely characterize every trace on the non-aggregated database. At
most eleven points are enough to uniquely characterize all considered traces.
www.nature.com/scientificreports
of spatial and temporal aggregation [Fig. S3A–B].
The power-law dependency of e means that, on average, each time
the spatial or temporal resolution of the traces is divided by two, their
uniqueness decreases by a constant factor , (2)2b
. This implies that
privacy is increasingly hard to gain by lowering the resolution of a
dataset.
Fig. 2B shows that, as expected, e increases with p. The mitigating
effect of p on e is mediated by the exponent b which decays linearly
with p: b 5 0.157 2 0.007p [Fig. 4E]. The dependence of b on p
implies that a few additional points might be all that is needed to
identify an individual in a dataset with a lower resolution. In fact,
given four points, a two-fold decrease in spatial or temporal resolu-
tion makes it 9.3% less likely to identify an individual, while given ten
points, the same two-fold decrease results in a reduction of only 6.2%
(see Table S1).
Because of the functional dependency of e on p through the expo-
nent b, mobility datasets are likely to be re-identifiable using
information on only a few outside locations.
Discussion
Our ability to generalize these results to other mobility datasets
depends on the sensitivity of our analysis to extensions of the data
lation densities.
Extensions of the geographical range of observation are also
unlikely to affect the results as human mobility is known to be highly
circumscribed. In fact, 94% of the individuals move within an average
radius of less than 100 km17
. This implies that geographical exten-
sions of the dataset will stay locally equivalent to our observations,
making the results robust to changes in geographical range.
From an inference perspective, it is worth noticing that the spatio-
temporal points do not equally increase the likelihood of uniquely
identifying a trace. Furthermore, the information added by a point is
highly dependent from the points already known. The amount of
information gained by knowing one more point can be defined as the
reduction of the cardinality of S(Ip) associated with this extra point.
The larger the decrease, the more useful the piece of information is.
Intuitively, a point on the MIT campus at 3AM is more likely to
make a trace unique than a point in downtown Boston on a Friday
evening.
This study is likely to underestimate e, and therefore the ease of re-
identification, as the spatio-temporal points are drawn at random
from users’ mobility traces. Our Ip are thus subject to the user’s
spatial and temporal distributions. Spatially, it has been shown that
the uncertainty of a typical user’s whereabouts measured by its
10 6
10 5
10 4
10 3
10 0
10 1
10 2
10 3
Number of antennas
Inhabitants
Probabilitydensityfunction
Median inter-interactions time per user [h]
0 12 24 36 48 60 72 84 96
10 0
10 -1
10 -2
10 -3
10 -4
10 0
10 -1
10 -2
10 -3
10 -4
10 -5
0 500 1000 1500 2000 2500
Number of interactions
Probabilitydensityfunction
A B C
Figure 3 | (A) Probability density function of the amount of recorded spatio-temporal points per user during a month. (B) Probability density function
of the median inter-interaction time with the service. (C) The number of antennas per region is correlated with its population (R2
5 .6426). These plots
strongly emphasize the discrete character of our dataset and its similarities with datasets such as the one collected by smartphone apps.Cell Phones
www.bgoncalves.com@bgoncalves
Privacy Sci Rep 3, 1376 (2013)
Temporal resolution [h]
Spatialresolution[v]
1 cell
3 cells
5 cells
7 cells
9 cells
11 cells
13 cells
Temporal resolution [h]
NormalizeduniquenessoftracesSpatialresolution[v]
Temporal resolution [h]
A B
Spatial resolution [v]
Normalizeduniquenessoftraces
C D
15
13
11
9
7
5
3
1
1 3 5 7 9 11 13 15
15
13
11
9
7
5
3
1
1 3 5 7 9 11 13 15
10 0
10 0
10 0
10 1
10 0
10 1
0.10
0.14
β
E
1 hour
3 hours
5 hours
7 hours
9 hours
11 hours
13 hours
Uniqueness of traces0.70
Uniqueness of traces0.70
www.nature.com/scientificreports
Cell Phones
www.bgoncalves.com@bgoncalves
Gravity Law of Commuting PNAS 106, 21484 (2009)
wij
i
j
US county commuting network
each node i : subpopulation (census area)
each link (ij) : interaction between subpopulations i and j
weight wij : number of people commuting from i to j per unit time
www.bgoncalves.com@bgoncalves
Gravity Law of Commuting PNAS 106, 21484 (2009)
w
(D)
/w
(M)
C)
E) F)
Distance (km)
w
(D)
/(NN)
10
-5
10
-4
10
-3
10
-2
10
-
10
10
10
-2
10
0
10
2
10
-
10
10
Population of origin
w
(D)
/w
(M)
w
(D)
/w
(M)
0 100 200 300
10
2
10 10
6
10
8
D)
ij
!
A)
B)10
1
10
3
10
5
10
1
10
3
10
5
C)
Distance (km)
w
(D)
/(NN)
Distance (km)
10
-5
10
-4
10
-3
10
-2
10
-2
10
0
10
2
w
(D)
/w
(M)
0 100 200 300 0 100 200 300
D)
ij
!
A)10
1
10
3
10
5
w
(D)
/w
(M)
C)
E) F)
Distance (km)
w
(D)
/(NN)
Distance (km)
10
-5
10
-4
10
-3
10
-2
10
-2
10
0
10
2
10
-2
10
0
10
2
10
2
10
4
10
6
10
8
10
-2
10
0
10
2
Population of destinationPopulation of origin
w
(D)
/w
(M)
w
(D)
/w
(M)
0 100 200 300 0 100 200 300
10
2
10 10
6
10
8
D)
ij
!
A)
B)
w
(D)
/w
(M)
C)
E) F)
Distance (km)
w
(D)
/(NN)
Distance (km)
10
-5
10
-4
10
-3
10
-2
10
-2
10
0
10
2
10
-2
10
0
10
2
10
2
10
4
10
6
10
8
10
-2
10
0
10
2
Population of destinationPopulation of origin
w
(D)
/w
(M)
w
(D)
/w
(M)
0 100 200 300 0 100 200 300
10
2
10 10
6
10
8
D)
ij
!
A)
B)
136 D. Balcan et al. / Journal of Computational Science 1 (2010) 132–145
Table 1
Commuting networks in each continent. Number of countries (N), number of admin-
istrative units (V) and inter-links between them (E) are summarized.
Continent N V E
Europe 17 65,880 4,490,650
North America 2 6986 182,255
Latin America 5 4301 102,117
Asia 4 4355 380,385
Oceania 2 746 30,679
Total 30 82,268 5,186,186
commuting. This allows to deal with self-similar units across the
world with respect to mobility as emerged from the tessellation and
not country specific administrative boundaries. We have therefore
mapped the different levels of commuting data into the geographi-
cal census areas formed by the Voronoi-like tessellation procedure
described above. The mapped commuting flows can be seen as a
second transport network connecting subpopulations that are geo-
graphically close. This second network can be overlaid to the WAN
in a multi-scale fashion to simulate realistic scenarios for disease
spreading. The network exhibits important variability in the num-
ber of commuters on each connection as well as in the total number
of commuters per geographical census area. Being the census areas
statistically homogeneous we can also extract a general statistical
law that allows for the synthetic generation of commuting net-
works in countries where real data are not available. A full account
of the commuting data obtained across different continents and
their statistical analysis can be found in Ref. [2].
3.3. Disease model
Table 2
Transitions between compartments and their rates.
Transition Type Rate
Sj → Lj Contagion j
Lj → Ia
j
Spontaneous εpa
Lj → It
j
ε(1 − pa)pt
Lj → Int
j
ε(1 − pa)(1 − pt)
Ia
j
→ Rj
It
j
→ Rj
Int
j
→ Rj
general, the force of infection is assumed to follow the mass action
principle for which the infection rate is = ˇI / N where ˇ is the
infection transmission rate and I / N is the density of infected indi-
viduals in the population. In the case of asymptomatic individuals
the force of infection is usually reduced by a factor rˇ. In the case of
multiple interacting subpopulations and different classes of infec-
tives the force of infection will be the sum of different contributions
as reported in Section 4.3.
Given the force of infection j in subpopulation j, each person
in the susceptible compartment (Sj) contracts the infection with
probability j t and enters the latent compartment (Lj), where t
is the time interval considered. Latent individuals exit the compart-
ment with probability ε t, and transit to asymptomatic infectious
compartment (Ia
j
) with probability pa or, with the complemen-
tary probability 1 − pa, become symptomatic infectious. Infectious
persons with symptoms are further divided between those who
can travel (It
j
), probability pt, and those who are travel-restricted
(Int
j
) with probability 1 − pt. All the infectious persons permanently
recover with probability t, entering the recovered compartment
www.bgoncalves.com@bgoncalves
Mobility and Social Networks
Coupling Mobility and Interactions in Social Media
and for their dependence on the distance. The error Err of this
null model is between 0:66–0:76 for the three countries, around
twice the error of the TF model (see Figure 6).
The linking model (L model) is a simplified version of the TF
model, without random mobility and the box size d?0. Agents
move to visit their contacts with probability pv, whereas with
probability 1{pv they do not perform any action. In this version
of the model, users can connect only by random connections or
when two of them coincide, visiting a common friend, which leads
to triadic closure. These two processes do not depend on the
distances between the users. A thorough description can be
obtained with a mean-field approach (see the corresponding
section). The results of the L model are shown in Figure 2. Due to
the triangle closing mechanism, this null model creates networks
with a considerable level of clustering. However, it does not
(e.g., for the US the TF model has Err lower by 0:5 and 1:5 than
the TF-normal and the TF-uniform models, respectively, as shown
in Figure 6).
Simplified models that neglect either geography or network
structure perform considerably worse than the TF model in
reproducing the properties of real networks. Likewise, non-realistic
assumptions on human mobility mechanism yield worse results
than the default TF model. To conclude, the coupling of
geography and structure through a realistic mobility mechanism
produces networks with significantly more realistic geographic and
structural properties.
Sensitivity of the TF Model to the Parameters and its
Modifications
The results presented so far have been obtained at the optimal
Figure 4. Simulation results: mobility and social networks. Mobility (upper row) and ego networks (lower row) of 20 random users (different
colors) for the instances of the TF model yielding the lowest error Err (see Figure 3). Mobility network shows mobility patterns of individual users
throughout entire simulation. Ego network shows the social connections at the end of the simulation.
doi:10.1371/journal.pone.0092196.g004
PLoS One 9, E92196 (2014)
www.bgoncalves.com@bgoncalves
Geo-Social Properties
Couplin
that has also an edge between i and k, forming a triangle. Note
a triangle consists of 3 triads centered on different nodes.
effect of the distance on the clustering coefficient can
incorporated by measuring the distances from each central n
j to two neighbors i and k forming a triad, d~dijzdjk,
calculating the network clustering restricted to triads with dist
d. This new function C(d) is the probability of closing a tria
given the distance d in a triad
C(d)~
D(d)
L(d)
,
where (d) and (d) are the numbers of triads and closed tr
for the distance d, respectively. The value of the global cluste
coefficient C can be recovered by averaging C(d) over d. In
datasets, we observe a drop in C(d) followed by a plateau, whi
best visible for the US networks (Figure 2E).
Given a triangle, several configurations are possible if the
diversity in the edge lengths. The triangle can be equilateral
the edges have the same length, isosceles if two have the s
length and the other is smaller, etc. We estimate the domi
shapes of the triangles in the network by measuring the dispari
defined as:
D~6
d2
1 zd2
2 zd2
3
(d1zd2zd3)2
{
1
3
 
,
where d1, d2 and d3 are the geographical distances between
locations of the users forming the triangle. The disparity t
values between 0 and 1 as the shape of the triangle passes f
equilateral to isosceles, where one edge is much smaller than
other two. D shows a distribution with two maxima in the on
social networks (Figure 2F), for low and high values. The two m
C(d).
doi:10.1371/journal.pone.0092196.g002
DL
eo-social properties. Various statistical network properties are plotted for the data obtained from Twitter (red squares),
), Brightkite (green triangles) and the null models (dashed lines), for the US (for the UK and Germany, see Figures S1 and S2).
Coupling Mobility and Interactions in Social Media
Triangle Disparity
eo-social properties. Various statistical network properties are plotted for the data obtained from Twitter (red squares),
), Brightkite (green triangles) and the null models (dashed lines), for the US (for the UK and Germany, see Figures S1 and S2).
enta), based on geography, matches well the data in Pl(d), but yields near-zero values for R(d), Jf (d) and C(d). The linking
triadic closure, produces enough clustering, but it does not reproduce the distance dependencies of Pl(d), R(d), Jf (d) and
e.0092196.g002
Coupling Mobility and Interactions in Social Media
Reciprocity
Figure 2. Network geo-social properties. Various statistical network properties are plotted for the data obtained from Twitter (red squares),
Gowalla (blue diamonds), Brightkite (green triangles) and the null models (dashed lines), for the US (for the UK and Germany, see Figures S1 and S2).
The spatial model (magenta), based on geography, matches well the data in Pl(d), but yields near-zero values for R(d), Jf (d) and C(d). The linking
model (cyan), based on triadic closure, produces enough clustering, but it does not reproduce the distance dependencies of Pl(d), R(d), Jf (d) and
C(d).
doi:10.1371/journal.pone.0092196.g002
Coupling Mobility and Interactions in Social Media
Prob of a Link
ocial properties. Various statistical network properties are plotted for the data obtained from Twitter (red squares),
ightkite (green triangles) and the null models (dashed lines), for the US (for the UK and Germany, see Figures S1 and S2).
Coupling Mobility and Interactions in Social Media
Clustering
PLoS One 9, E92196 (2014)
www.bgoncalves.com@bgoncalves
Geo-Social Model
New position of u
{
{
{
Detect all
encounters e
in the box of u
Visit a random
neighbour
Jump to
a new location
Starting position
of user u
Created new
social links
PLoS One 9, E92196 (2014)
www.bgoncalves.com@bgoncalves
Model Fitting
0:39 for Germany. For simplicity, we focus on the Twitter
networks only, although similar results are obtained for the other
datasets.
Results
Simulations for the Optimal Parameters
An example with the displacements between the consecutive
locations and the ego networks for a sample of individuals, as
generated by the TF model, are displayed in Figure 4. The
parameters of the model are set to the ones that correspond to the
minimum of the error Err. As shown, the agents tend to stay close
to their original positions. Occasional long jumps occur due to
friend visits that live far apart. In this range of parameters and
simulation times, the main mechanism for generating long distance
second null model, the linking model (L model), in contrast, is
based only on random linking and triadic closure, and it is
equivalent to the TF model without the mobility. We consider the
two uncoupled null models and compare their results with those of
the TF model. In this way, we demonstrate the importance of the
coupling through a realistic mobility mechanism to reproduce the
empirical networks.
The spatial model (S model) consists of randomly connecting
pair of users with a probability that decays as power-law of the
distance between them (suggested in [41]). The exponent of the
power-law is fixed at {0:7 following Figure 2A. The results of
the S model are shown in the panels of Figure 2. While it is set to
match Pl dð Þ, other properties such as P(k), R dð Þ, Jf dð Þ, C dð Þ or
P Dð Þ are not well reproduced. The S model fails to account for the
high level of clustering and reciprocity in the empirical networks
Figure 3. Fitting the TF model. Values of the error Err when pv and pc are changed. The minimum error for each of the plots is marked with a red
rectangle.
doi:10.1371/journal.pone.0092196.g003
PLOS ONE | www.plosone.org 5 March 2014 | Volume 9 | Issue 3 | e92196
Prob. to Make a New Friend
Prob.toVisitanOldFriend
PLoS One 9, E92196 (2014)
www.bgoncalves.com@bgoncalves
perties of the model networks. Various statistical properties are plotted for the networks obtained from Twitter data
lation of the TF model (black line) for the US. Corresponding results for the UK and Germany can be found in Figures S3
Coupling Mobility and Interactions in Social Media
al properties of the model networks. Various statistical properties are plotted for the networks obtained from Twitter data
m simulation of the TF model (black line) for the US. Corresponding results for the UK and Germany can be found in Figures S3
Coupling Mobility and Interactions in Social Media
Model Results
Reciprocity
Clustering Triangle Disparity
andom connections, and so the distribution of triangles disparity prevent
Figure 5. Geo-social properties of the model networks. Various statistical pro
red squares) and from simulation of the TF model (black line) for the US. Correspond
nd S4.
doi:10.1371/journal.pone.0092196.g005
that has also an edge between i and k, forming a triangle. Note
a triangle consists of 3 triads centered on different nodes.
effect of the distance on the clustering coefficient can
incorporated by measuring the distances from each central n
j to two neighbors i and k forming a triad, d~dijzdjk,
calculating the network clustering restricted to triads with dist
d. This new function C(d) is the probability of closing a tria
given the distance d in a triad
C(d)~
D(d)
L(d)
,
where (d) and (d) are the numbers of triads and closed tr
for the distance d, respectively. The value of the global cluste
coefficient C can be recovered by averaging C(d) over d. In
datasets, we observe a drop in C(d) followed by a plateau, whi
best visible for the US networks (Figure 2E).
Given a triangle, several configurations are possible if the
diversity in the edge lengths. The triangle can be equilateral
the edges have the same length, isosceles if two have the s
length and the other is smaller, etc. We estimate the domi
shapes of the triangles in the network by measuring the dispari
defined as:
D~6
d2
1 zd2
2 zd2
3
(d1zd2zd3)2
{
1
3
 
,
where d1, d2 and d3 are the geographical distances between
locations of the users forming the triangle. The disparity t
values between 0 and 1 as the shape of the triangle passes f
equilateral to isosceles, where one edge is much smaller than
other two. D shows a distribution with two maxima in the on
social networks (Figure 2F), for low and high values. The two m
C(d).
doi:10.1371/journal.pone.0092196.g002
DL
Figure 5. Geo-social properties of the model networks. Various statistical properties are plotted for the networks obtaine
(red squares) and from simulation of the TF model (black line) for the US. Corresponding results for the UK and Germany can be
and S4.
doi:10.1371/journal.pone.0092196.g005
Coupling Mobility and Interactio
s, and so the distribution of triangles disparity prevents the model from producing networks with characteristics
al properties of the model networks. Various statistical properties are plotted for the networks obtained from Twitter data
m simulation of the TF model (black line) for the US. Corresponding results for the UK and Germany can be found in Figures S3
one.0092196.g005
Coupling Mobility and Interactions in Social Media
Prob of a Link
PLoS One 9, E92196 (2014)
www.bgoncalves.com@bgoncalves
Human Diffusion J. R. Soc. Interface 12, 20150473 (2015)
www.bgoncalves.com@bgoncalves
Human Diffusion J. R. Soc. Interface 12, 20150473 (2015)
Starting from Paris
Starting from New York
a
b
www.bgoncalves.com@bgoncalves
Human Diffusion J. R. Soc. Interface 12, 20150473 (2015)
Starting from New Yorkb
www.bgoncalves.com@bgoncalves
Residents and Tourists J. R. Soc. Interface 12, 20150473 (2015)
www.bgoncalves.com@bgoncalves
Residents and Tourists J. R. Soc. Interface 12, 20150473 (2015)
50 100 150 200 250 300 350
0.1
0.2
0.3
0.4
0.5
0.6
Coverage
R
~
Local
Non−Local
a
100
200
300
400
500
600
0.2 0.3 0.4 0.5 0.6 0.7 0.8
Proportion of Non−Local Users
Coverage
b
125 135 145 155
New York
Chicago
San Francisco
Shanghai
Dallas
Berlin
Paris
Saint Petersburg
Beijing
Moscow
Coverage
c
325 335 345
Houston
Barcelona
Brussels
Detroit
Lima
Istanbul
Rome
Moscow
Paris
Lisbon
Coverage
d
www.bgoncalves.com@bgoncalves
City Communities J. R. Soc. Interface 12, 20150473 (2015)
0 2 4 6 8 10
Los Angeles
San Francisco
Miami
Singapore
Tokyo
Paris
London
New York
Weighted Betwennness (x 102
)
Weighted degree
www.bgoncalves.com@bgoncalves
Angkor Wat
Forbidden City
Corcovado
Eiffel Tower
Giza
Golden PavilionGrand Canyon
Hagia Sophia
Iguazu Falls
Kukulcan
London Tower
Machu Pichu
Mount Fuji
Niagara Falls
Taj Mahal
Pisa Tower
Times Square
Zocalo
Saint Basil's Cathedral
Ahlambra
www.bgoncalves.com@bgoncalves
Angkor Wat
Forbidden City
Corcovado
Eiffel Tower
Giza
Golden PavilionGrand Canyon
Hagia Sophia
Iguazu Falls
Kukulcan
London Tower
Machu Pichu
Mount Fuji
Niagara Falls
Taj Mahal
Pisa Tower
Times Square
Zocalo
Saint Basil's Cathedral
Ahlambra
www.bgoncalves.com@bgoncalves
Tourism EPJ Data Science 5, 12 (2016)
www.bgoncalves.com@bgoncalves
Touristic Sites
0.4 0.5 0.6
Radius
Times Square
Niagara Falls
Angkor Wat
Grand Canyon
Machu Pichu
Giza
Forbidden City
Eiffel Tower
Pisa Tower
Taj Mahal
80 90 100 110 120
Coverage (cell)
Iguazu Falls
Giza
Times Square
Machu Pichu
Forbidden City
Niagara Falls
Eiffel Tower
Taj Mahal
Grand Canyon
Pisa Tower
20 24 28 32
Coverage (country)
London Tower
Times Square
Hagia Sophia
Machu Pichu
Angkor Wat
Forbidden City
Pisa Tower
Eiffel Tower
Giza
Taj Mahal
(a) (b) (c)
EPJ Data Science 5, 12 (2016)
www.bgoncalves.com@bgoncalves
Touristic Sites
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
EPJ Data Science 5, 12 (2016)
www.bgoncalves.com@bgoncalves
Discussion
• Online Social Networks generate unprecedented amounts of data on Human
Behavior
• The massification of GPS-enabled devices allows us to observe Geographical
variations
• Human mobility is an intrinsically multi-scale process
• Twitter is a good source of geolocated data, but it has many biases that must be
considered
• Different types of links serve different social and information diffusion functions
• The strength of ties provides important clues to the social structure
• Colocation increases the likelihood of friendship
• Mobility and Social Structure mutually influence each other
• Mobility is a proxy for the centrality of a city or touristic locale
www.bgoncalves.com@bgoncalves
References
PLoS One 10, e0115545 (2015)
Sci Rep 3, 1376 (2013)
Social Networks 34, 73 (2012)
Social Networks 34, 82 (2012)
PLoS One 7, e29358 (2012)
ICWSM’11, 375 (2011)
PNAS 107, 22436 (2010)
Nature 453, 779 (2008)
PNAS 104, 7333 (2007)
EPJ Data Science 5, 12 (2016)
J. R. Soc. Interface 12, 20150473 (2015)
PLoS One 9, E92196 (2014)
PLoS One 8, E61981 (2013)
ICWSM’11, 89 (2011)
PNAS 106, 21484 (2009)
www.bgoncalves.com@bgoncalves Jun 20-22
www.bgoncalves.com@bgoncalves
CompleNet 2017
Dubrovnic, Croatia — March/April
Jun 20-22
www.bgoncalves.com@bgoncalves
CompleNet 2017
Dubrovnic, Croatia — March/April
Jun 20-22

More Related Content

Viewers also liked

Complenet 2017
Complenet 2017Complenet 2017
Complenet 2017tnoulas
 
Machine(s) Learning with Neural Networks
Machine(s) Learning with Neural NetworksMachine(s) Learning with Neural Networks
Machine(s) Learning with Neural NetworksBruno Gonçalves
 
Mining Geo-referenced Data: Location-based Services and the Sharing Economy
Mining Geo-referenced Data: Location-based Services and the Sharing EconomyMining Geo-referenced Data: Location-based Services and the Sharing Economy
Mining Geo-referenced Data: Location-based Services and the Sharing Economytnoulas
 
Word2vec in Theory Practice with TensorFlow
Word2vec in Theory Practice with TensorFlowWord2vec in Theory Practice with TensorFlow
Word2vec in Theory Practice with TensorFlowBruno Gonçalves
 
A practical Introduction to Machine(s) Learning
A practical Introduction to Machine(s) LearningA practical Introduction to Machine(s) Learning
A practical Introduction to Machine(s) LearningBruno Gonçalves
 
Making Sense of Data Big and Small
Making Sense of Data Big and SmallMaking Sense of Data Big and Small
Making Sense of Data Big and SmallBruno Gonçalves
 

Viewers also liked (7)

Complenet 2017
Complenet 2017Complenet 2017
Complenet 2017
 
Machine(s) Learning with Neural Networks
Machine(s) Learning with Neural NetworksMachine(s) Learning with Neural Networks
Machine(s) Learning with Neural Networks
 
Word2vec and Friends
Word2vec and FriendsWord2vec and Friends
Word2vec and Friends
 
Mining Geo-referenced Data: Location-based Services and the Sharing Economy
Mining Geo-referenced Data: Location-based Services and the Sharing EconomyMining Geo-referenced Data: Location-based Services and the Sharing Economy
Mining Geo-referenced Data: Location-based Services and the Sharing Economy
 
Word2vec in Theory Practice with TensorFlow
Word2vec in Theory Practice with TensorFlowWord2vec in Theory Practice with TensorFlow
Word2vec in Theory Practice with TensorFlow
 
A practical Introduction to Machine(s) Learning
A practical Introduction to Machine(s) LearningA practical Introduction to Machine(s) Learning
A practical Introduction to Machine(s) Learning
 
Making Sense of Data Big and Small
Making Sense of Data Big and SmallMaking Sense of Data Big and Small
Making Sense of Data Big and Small
 

Similar to Human Mobility (with Mobile Devices)

Consistent, Thin and Dumb. UA Mobile 2016.
Consistent, Thin and Dumb. UA Mobile 2016.Consistent, Thin and Dumb. UA Mobile 2016.
Consistent, Thin and Dumb. UA Mobile 2016.UA Mobile
 
Local Variation of Collective Attention in Hashtag Spike Trains
Local Variation of Collective Attention in Hashtag Spike TrainsLocal Variation of Collective Attention in Hashtag Spike Trains
Local Variation of Collective Attention in Hashtag Spike TrainsNanyang Technological University
 
Presenting statistics in social media 2012
Presenting statistics in social media 2012Presenting statistics in social media 2012
Presenting statistics in social media 2012University of Pittsburgh
 
Realtime Data Visualization
Realtime Data VisualizationRealtime Data Visualization
Realtime Data Visualizationphil_renaud
 
Data Visualization using matplotlib
Data Visualization using matplotlibData Visualization using matplotlib
Data Visualization using matplotlibBruno Gonçalves
 
Can Kanye West Save Gap? Real-Time Consumer Social Media Segmentation On CARTO
Can Kanye West Save Gap? Real-Time Consumer Social Media Segmentation On CARTOCan Kanye West Save Gap? Real-Time Consumer Social Media Segmentation On CARTO
Can Kanye West Save Gap? Real-Time Consumer Social Media Segmentation On CARTOCARTO
 
Getting comfortable with Data
Getting comfortable with DataGetting comfortable with Data
Getting comfortable with DataRitvvij Parrikh
 
Advanced Analytics: How to track and analyze visitors across multiple devices
Advanced Analytics: How to track and analyze visitors across multiple devicesAdvanced Analytics: How to track and analyze visitors across multiple devices
Advanced Analytics: How to track and analyze visitors across multiple devicesOrbit Media Studios
 
Metaverse x AI x Web3 x Sustainability Convergence
Metaverse x AI x  Web3 x Sustainability ConvergenceMetaverse x AI x  Web3 x Sustainability Convergence
Metaverse x AI x Web3 x Sustainability ConvergenceAlex G. Lee, Ph.D. Esq. CLP
 
Filtering From the Firehose: Real Time Social Media Streaming
Filtering From the Firehose: Real Time Social Media StreamingFiltering From the Firehose: Real Time Social Media Streaming
Filtering From the Firehose: Real Time Social Media StreamingCloud Elements
 
6 years of my private G+ Spotfire community
6 years of my private G+ Spotfire community6 years of my private G+ Spotfire community
6 years of my private G+ Spotfire communityChristof Gaenzler
 
Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...
Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...
Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...Data Con LA
 
Impact of Urban Revitalization in Birmingham Alabama
Impact of Urban Revitalization in Birmingham AlabamaImpact of Urban Revitalization in Birmingham Alabama
Impact of Urban Revitalization in Birmingham AlabamaStormBourne, LLC
 
Cat Videos Save Lives
Cat Videos Save LivesCat Videos Save Lives
Cat Videos Save LivesVictor Kholod
 

Similar to Human Mobility (with Mobile Devices) (20)

Consistent, Thin and Dumb. UA Mobile 2016.
Consistent, Thin and Dumb. UA Mobile 2016.Consistent, Thin and Dumb. UA Mobile 2016.
Consistent, Thin and Dumb. UA Mobile 2016.
 
Presenting statistics in social media
Presenting statistics in social mediaPresenting statistics in social media
Presenting statistics in social media
 
Local Variation of Collective Attention in Hashtag Spike Trains
Local Variation of Collective Attention in Hashtag Spike TrainsLocal Variation of Collective Attention in Hashtag Spike Trains
Local Variation of Collective Attention in Hashtag Spike Trains
 
Presenting statistics in social media 2012
Presenting statistics in social media 2012Presenting statistics in social media 2012
Presenting statistics in social media 2012
 
Presenting statistics in social media
Presenting statistics in social mediaPresenting statistics in social media
Presenting statistics in social media
 
Realtime Data Visualization
Realtime Data VisualizationRealtime Data Visualization
Realtime Data Visualization
 
Data Visualization using matplotlib
Data Visualization using matplotlibData Visualization using matplotlib
Data Visualization using matplotlib
 
Can Kanye West Save Gap? Real-Time Consumer Social Media Segmentation On CARTO
Can Kanye West Save Gap? Real-Time Consumer Social Media Segmentation On CARTOCan Kanye West Save Gap? Real-Time Consumer Social Media Segmentation On CARTO
Can Kanye West Save Gap? Real-Time Consumer Social Media Segmentation On CARTO
 
C sanlitalk lvhash_spike_fromdatatoknowledge_mons2015
C sanlitalk lvhash_spike_fromdatatoknowledge_mons2015C sanlitalk lvhash_spike_fromdatatoknowledge_mons2015
C sanlitalk lvhash_spike_fromdatatoknowledge_mons2015
 
Social and economical networks from (big-)data - Esteban Moro II
Social and economical networks from (big-)data - Esteban Moro IISocial and economical networks from (big-)data - Esteban Moro II
Social and economical networks from (big-)data - Esteban Moro II
 
Social and economical networks from (big-)data - Esteban Moro
Social and economical networks from (big-)data - Esteban MoroSocial and economical networks from (big-)data - Esteban Moro
Social and economical networks from (big-)data - Esteban Moro
 
Getting comfortable with Data
Getting comfortable with DataGetting comfortable with Data
Getting comfortable with Data
 
Advanced Analytics: How to track and analyze visitors across multiple devices
Advanced Analytics: How to track and analyze visitors across multiple devicesAdvanced Analytics: How to track and analyze visitors across multiple devices
Advanced Analytics: How to track and analyze visitors across multiple devices
 
Metaverse x AI x Web3 x Sustainability Convergence
Metaverse x AI x  Web3 x Sustainability ConvergenceMetaverse x AI x  Web3 x Sustainability Convergence
Metaverse x AI x Web3 x Sustainability Convergence
 
Filtering From the Firehose: Real Time Social Media Streaming
Filtering From the Firehose: Real Time Social Media StreamingFiltering From the Firehose: Real Time Social Media Streaming
Filtering From the Firehose: Real Time Social Media Streaming
 
6 years of my private G+ Spotfire community
6 years of my private G+ Spotfire community6 years of my private G+ Spotfire community
6 years of my private G+ Spotfire community
 
Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...
Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...
Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...
 
Impact of Urban Revitalization in Birmingham Alabama
Impact of Urban Revitalization in Birmingham AlabamaImpact of Urban Revitalization in Birmingham Alabama
Impact of Urban Revitalization in Birmingham Alabama
 
VOLT - ESWC 2016
VOLT - ESWC 2016VOLT - ESWC 2016
VOLT - ESWC 2016
 
Cat Videos Save Lives
Cat Videos Save LivesCat Videos Save Lives
Cat Videos Save Lives
 

Recently uploaded

Explainable AI for distinguishing future climate change scenarios
Explainable AI for distinguishing future climate change scenariosExplainable AI for distinguishing future climate change scenarios
Explainable AI for distinguishing future climate change scenariosZachary Labe
 
Gas-ExchangeS-in-Plants-and-Animals.pptx
Gas-ExchangeS-in-Plants-and-Animals.pptxGas-ExchangeS-in-Plants-and-Animals.pptx
Gas-ExchangeS-in-Plants-and-Animals.pptxGiovaniTrinidad
 
Loudspeaker- direct radiating type and horn type.pptx
Loudspeaker- direct radiating type and horn type.pptxLoudspeaker- direct radiating type and horn type.pptx
Loudspeaker- direct radiating type and horn type.pptxpriyankatabhane
 
Quarter 4_Grade 8_Digestive System Structure and Functions
Quarter 4_Grade 8_Digestive System Structure and FunctionsQuarter 4_Grade 8_Digestive System Structure and Functions
Quarter 4_Grade 8_Digestive System Structure and FunctionsCharlene Llagas
 
The Sensory Organs, Anatomy and Function
The Sensory Organs, Anatomy and FunctionThe Sensory Organs, Anatomy and Function
The Sensory Organs, Anatomy and FunctionJadeNovelo1
 
Environmental Acoustics- Speech interference level, acoustics calibrator.pptx
Environmental Acoustics- Speech interference level, acoustics calibrator.pptxEnvironmental Acoustics- Speech interference level, acoustics calibrator.pptx
Environmental Acoustics- Speech interference level, acoustics calibrator.pptxpriyankatabhane
 
Immunoblott technique for protein detection.ppt
Immunoblott technique for protein detection.pptImmunoblott technique for protein detection.ppt
Immunoblott technique for protein detection.pptAmirRaziq1
 
CHROMATOGRAPHY PALLAVI RAWAT.pptx
CHROMATOGRAPHY  PALLAVI RAWAT.pptxCHROMATOGRAPHY  PALLAVI RAWAT.pptx
CHROMATOGRAPHY PALLAVI RAWAT.pptxpallavirawat456
 
final waves properties grade 7 - third quarter
final waves properties grade 7 - third quarterfinal waves properties grade 7 - third quarter
final waves properties grade 7 - third quarterHanHyoKim
 
DOG BITE management in pediatrics # for Pediatric pgs# topic presentation # f...
DOG BITE management in pediatrics # for Pediatric pgs# topic presentation # f...DOG BITE management in pediatrics # for Pediatric pgs# topic presentation # f...
DOG BITE management in pediatrics # for Pediatric pgs# topic presentation # f...HafsaHussainp
 
Observation of Gravitational Waves from the Coalescence of a 2.5–4.5 M⊙ Compa...
Observation of Gravitational Waves from the Coalescence of a 2.5–4.5 M⊙ Compa...Observation of Gravitational Waves from the Coalescence of a 2.5–4.5 M⊙ Compa...
Observation of Gravitational Waves from the Coalescence of a 2.5–4.5 M⊙ Compa...Sérgio Sacani
 
complex analysis best book for solving questions.pdf
complex analysis best book for solving questions.pdfcomplex analysis best book for solving questions.pdf
complex analysis best book for solving questions.pdfSubhamKumar3239
 
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...D. B. S. College Kanpur
 
GenAI talk for Young at Wageningen University & Research (WUR) March 2024
GenAI talk for Young at Wageningen University & Research (WUR) March 2024GenAI talk for Young at Wageningen University & Research (WUR) March 2024
GenAI talk for Young at Wageningen University & Research (WUR) March 2024Jene van der Heide
 
Q4-Mod-1c-Quiz-Projectile-333344444.pptx
Q4-Mod-1c-Quiz-Projectile-333344444.pptxQ4-Mod-1c-Quiz-Projectile-333344444.pptx
Q4-Mod-1c-Quiz-Projectile-333344444.pptxtuking87
 
Forensic limnology of diatoms by Sanjai.pptx
Forensic limnology of diatoms by Sanjai.pptxForensic limnology of diatoms by Sanjai.pptx
Forensic limnology of diatoms by Sanjai.pptxkumarsanjai28051
 
Science (Communication) and Wikipedia - Potentials and Pitfalls
Science (Communication) and Wikipedia - Potentials and PitfallsScience (Communication) and Wikipedia - Potentials and Pitfalls
Science (Communication) and Wikipedia - Potentials and PitfallsDobusch Leonhard
 
6.2 Pests of Sesame_Identification_Binomics_Dr.UPR
6.2 Pests of Sesame_Identification_Binomics_Dr.UPR6.2 Pests of Sesame_Identification_Binomics_Dr.UPR
6.2 Pests of Sesame_Identification_Binomics_Dr.UPRPirithiRaju
 
Combining Asynchronous Task Parallelism and Intel SGX for Secure Deep Learning
Combining Asynchronous Task Parallelism and Intel SGX for Secure Deep LearningCombining Asynchronous Task Parallelism and Intel SGX for Secure Deep Learning
Combining Asynchronous Task Parallelism and Intel SGX for Secure Deep Learningvschiavoni
 

Recently uploaded (20)

Explainable AI for distinguishing future climate change scenarios
Explainable AI for distinguishing future climate change scenariosExplainable AI for distinguishing future climate change scenarios
Explainable AI for distinguishing future climate change scenarios
 
Gas-ExchangeS-in-Plants-and-Animals.pptx
Gas-ExchangeS-in-Plants-and-Animals.pptxGas-ExchangeS-in-Plants-and-Animals.pptx
Gas-ExchangeS-in-Plants-and-Animals.pptx
 
Loudspeaker- direct radiating type and horn type.pptx
Loudspeaker- direct radiating type and horn type.pptxLoudspeaker- direct radiating type and horn type.pptx
Loudspeaker- direct radiating type and horn type.pptx
 
Quarter 4_Grade 8_Digestive System Structure and Functions
Quarter 4_Grade 8_Digestive System Structure and FunctionsQuarter 4_Grade 8_Digestive System Structure and Functions
Quarter 4_Grade 8_Digestive System Structure and Functions
 
The Sensory Organs, Anatomy and Function
The Sensory Organs, Anatomy and FunctionThe Sensory Organs, Anatomy and Function
The Sensory Organs, Anatomy and Function
 
Environmental Acoustics- Speech interference level, acoustics calibrator.pptx
Environmental Acoustics- Speech interference level, acoustics calibrator.pptxEnvironmental Acoustics- Speech interference level, acoustics calibrator.pptx
Environmental Acoustics- Speech interference level, acoustics calibrator.pptx
 
Immunoblott technique for protein detection.ppt
Immunoblott technique for protein detection.pptImmunoblott technique for protein detection.ppt
Immunoblott technique for protein detection.ppt
 
CHROMATOGRAPHY PALLAVI RAWAT.pptx
CHROMATOGRAPHY  PALLAVI RAWAT.pptxCHROMATOGRAPHY  PALLAVI RAWAT.pptx
CHROMATOGRAPHY PALLAVI RAWAT.pptx
 
final waves properties grade 7 - third quarter
final waves properties grade 7 - third quarterfinal waves properties grade 7 - third quarter
final waves properties grade 7 - third quarter
 
DOG BITE management in pediatrics # for Pediatric pgs# topic presentation # f...
DOG BITE management in pediatrics # for Pediatric pgs# topic presentation # f...DOG BITE management in pediatrics # for Pediatric pgs# topic presentation # f...
DOG BITE management in pediatrics # for Pediatric pgs# topic presentation # f...
 
Observation of Gravitational Waves from the Coalescence of a 2.5–4.5 M⊙ Compa...
Observation of Gravitational Waves from the Coalescence of a 2.5–4.5 M⊙ Compa...Observation of Gravitational Waves from the Coalescence of a 2.5–4.5 M⊙ Compa...
Observation of Gravitational Waves from the Coalescence of a 2.5–4.5 M⊙ Compa...
 
complex analysis best book for solving questions.pdf
complex analysis best book for solving questions.pdfcomplex analysis best book for solving questions.pdf
complex analysis best book for solving questions.pdf
 
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
 
GenAI talk for Young at Wageningen University & Research (WUR) March 2024
GenAI talk for Young at Wageningen University & Research (WUR) March 2024GenAI talk for Young at Wageningen University & Research (WUR) March 2024
GenAI talk for Young at Wageningen University & Research (WUR) March 2024
 
Q4-Mod-1c-Quiz-Projectile-333344444.pptx
Q4-Mod-1c-Quiz-Projectile-333344444.pptxQ4-Mod-1c-Quiz-Projectile-333344444.pptx
Q4-Mod-1c-Quiz-Projectile-333344444.pptx
 
Forensic limnology of diatoms by Sanjai.pptx
Forensic limnology of diatoms by Sanjai.pptxForensic limnology of diatoms by Sanjai.pptx
Forensic limnology of diatoms by Sanjai.pptx
 
Science (Communication) and Wikipedia - Potentials and Pitfalls
Science (Communication) and Wikipedia - Potentials and PitfallsScience (Communication) and Wikipedia - Potentials and Pitfalls
Science (Communication) and Wikipedia - Potentials and Pitfalls
 
6.2 Pests of Sesame_Identification_Binomics_Dr.UPR
6.2 Pests of Sesame_Identification_Binomics_Dr.UPR6.2 Pests of Sesame_Identification_Binomics_Dr.UPR
6.2 Pests of Sesame_Identification_Binomics_Dr.UPR
 
Let’s Say Someone Did Drop the Bomb. Then What?
Let’s Say Someone Did Drop the Bomb. Then What?Let’s Say Someone Did Drop the Bomb. Then What?
Let’s Say Someone Did Drop the Bomb. Then What?
 
Combining Asynchronous Task Parallelism and Intel SGX for Secure Deep Learning
Combining Asynchronous Task Parallelism and Intel SGX for Secure Deep LearningCombining Asynchronous Task Parallelism and Intel SGX for Secure Deep Learning
Combining Asynchronous Task Parallelism and Intel SGX for Secure Deep Learning
 

Human Mobility (with Mobile Devices)

  • 8. www.bgoncalves.com@bgoncalves Anatomy of a Tweet https://github.com/bmtgoncalves/Mining-Georeferenced-Data
  • 9. www.bgoncalves.com@bgoncalves Anatomy of a Tweet [u'contributors', u'truncated', u'text', u'in_reply_to_status_id', u'id', u'favorite_count', u'source', u'retweeted', u'coordinates', u'entities', u'in_reply_to_screen_name', u'in_reply_to_user_id', u'retweet_count', u'id_str', u'favorited', u'user', u'geo', u'in_reply_to_user_id_str', u'possibly_sensitive', u'lang', u'created_at', u'in_reply_to_status_id_str', u'place', u'metadata'] https://github.com/bmtgoncalves/Mining-Georeferenced-Data
  • 10. www.bgoncalves.com@bgoncalves Anatomy of a Tweet [u'contributors', u'truncated', u'text', u'in_reply_to_status_id', u'id', u'favorite_count', u'source', u'retweeted', u'coordinates', u'entities', u'in_reply_to_screen_name', u'in_reply_to_user_id', u'retweet_count', u'id_str', u'favorited', u'user', u'geo', u'in_reply_to_user_id_str', u'possibly_sensitive', u'lang', u'created_at', u'in_reply_to_status_id_str', u'place', u'metadata'] [u'follow_request_sent', u'profile_use_background_image', u'default_profile_image', u'id', u'profile_background_image_url_https', u'verified', u'profile_text_color', u'profile_image_url_https', u'profile_sidebar_fill_color', u'entities', u'followers_count', u'profile_sidebar_border_color', u'id_str', u'profile_background_color', u'listed_count', u'is_translation_enabled', u'utc_offset', u'statuses_count', u'description', u'friends_count', u'location', u'profile_link_color', u'profile_image_url', u'following', u'geo_enabled', u'profile_banner_url', u'profile_background_image_url', u'screen_name', u'lang', u'profile_background_tile', u'favourites_count', u'name', u'notifications', u'url', u'created_at', u'contributors_enabled', u'time_zone', u'protected', u'default_profile', u'is_translator'] https://github.com/bmtgoncalves/Mining-Georeferenced-Data
  • 11. www.bgoncalves.com@bgoncalves Anatomy of a Tweet [u'contributors', u'truncated', u'text', u'in_reply_to_status_id', u'id', u'favorite_count', u'source', u'retweeted', u'coordinates', u'entities', u'in_reply_to_screen_name', u'in_reply_to_user_id', u'retweet_count', u'id_str', u'favorited', u'user', u'geo', u'in_reply_to_user_id_str', u'possibly_sensitive', u'lang', u'created_at', u'in_reply_to_status_id_str', u'place', u'metadata'] [u'type', u'coordinates'] [u'symbols', u'user_mentions', u'hashtags', u'urls'] u'<a href="http://foursquare.com" rel=“nofollow"> foursquare</a>' u"I'm at Terminal Rodovixe1rio de Feira de Santana (Feira de Santana, BA) http://t.co/WirvdHwYMq" {u'display_url': u'4sq.com/1k5MeYF', u'expanded_url': u'http://4sq.com/1k5MeYF', u'indices': [70, 92], u'url': u'http://t.co/WirvdHwYMq'} https://github.com/bmtgoncalves/Mining-Georeferenced-Data
  • 12. www.bgoncalves.com@bgoncalves API Basics https://dev.twitter.com/docs • The twitter module provides the oauth interface. We just need to provide the right credentials. • Best to keep the credentials in a dict and parametrize our calls with the dict key. This way we can switch between different accounts easily. • .Twitter(auth) takes an OAuth instance as argument and returns a Twitter object that we can use to interact with the API • Twitter methods mimic API structure • 4 basic types of objects: • Tweets • Users • Entities • Places https://github.com/bmtgoncalves/Mining-Georeferenced-Data
  • 13. www.bgoncalves.com@bgoncalves User Timeline https://dev.twitter.com/docs/api/1.1/get/statuses/user_timeline • .statuses.user_timeline() returns a set of tweets posted by a single user • Important options: • include_rts=‘true’ to Include retweets by this user • count=200 number of tweets to return in each call • trim_user=‘true’ to not include the user information (save bandwidth and processing time) • max_id=1234 to include only tweets with an id lower than 1234 • Returns at most 200 tweets in each call. Can get all of a users tweets (up to 3200) with multiple calls using max_id
  • 14. @bgoncalves User Timeline https://dev.twitter.com/docs/api/1.1/get/statuses/user_timeline import twitter from twitter_accounts import accounts app = accounts["social"] auth = twitter.oauth.OAuth(app["token"], app["token_secret"], app["api_key"], app["api_secret"]) twitter_api = twitter.Twitter(auth=auth) screen_name = "bgoncalves" args = { "count" : 200, "trim_user": "true", "include_rts": "true" } tweets = twitter_api.statuses.user_timeline(screen_name = screen_name, **args) tweets_new = tweets while len(tweets_new) > 0: max_id = tweets[-1]["id"] - 1 tweets_new = twitter_api.statuses.user_timeline(screen_name = screen_name, max_id=max_id, **args) tweets += tweets_new print "Found", len(tweets), "tweets" timeline_twitter.py
  • 15. www.bgoncalves.com@bgoncalves Streaming Geocoded data • The Streaming api provides realtime data, subject to filters • Use TwitterStream instead of Twitter object (.TwitterStream(auth=twitter_api.auth)) • .status.filter(track=q) will return tweets that match the query q in real time • Returns generator that you can iterate over • .status.filter(locations=bb) will return tweets that occur within the bounding box bb in real time • bb is a comma separated pair of lat/lon coordinates. • -180,-90,180,90 - World • -74,40,-73,41 - NYC https://github.com/bmtgoncalves/Mining-Georeferenced-Data
  • 16. www.bgoncalves.com@bgoncalves Streaming Geocoded data import twitter from twitter_accounts import accounts import gzip app = accounts["social"] auth = twitter.oauth.OAuth(app["token"], app["token_secret"], app["api_key"], app["api_secret"]) stream_api = twitter.TwitterStream(auth=auth) query = "-74,40,-73,41" # NYC stream_results = stream_api.statuses.filter(locations = query) tweet_count = 0 fp = gzip.open("NYC.json.gz", "a") for tweet in stream_results: try: tweet_count += 1 print tweet_count, tweet[“id”] print >> fp, tweet except: pass location_twitter.py https://github.com/bmtgoncalves/Mining-Georeferenced-Data
  • 25. www.bgoncalves.com@bgoncalves Demographics ICWSM’11, 375 (2011) users who we could infer a gender for, based on their name and the list previously described. We do so by comparing the first word of their self-reported name to the gender list. We observe that there exists a match for 64.2% of the users. Moreover, we find a strong bias towards male users: Fully 71.8% of the the users who we find a name match for had a male name. 0 0.2 0.4 0.6 0.8 1 2007-01 2007-07 2008-01 2008-07 2009-01 2009-07 FractionofJoiningUsers whoareMale Date Figure 3: Gender of joining users over time, binned into groups of 10,000 joining users (note that the join rate in- creases substantially). The bias towards male users is ob- served to be decreasing over time. each last name with over 100 individuals in the U.S. ing the 2000 Census, the Census releases the distributio race/ethnicity for that last name. For example, the last n “Myers” was observed to correspond to Caucasians 86% the time, African-Americans 9.7%, Asians 0.4%, and panics 1.4%. Race/ethnicity distribution of Twitter users We first determined the number of U.S.-based users whom we could infer the race/ethnicity by comparing last word of their self-reported name to the U.S. Ce last name list. We observed that we found a match 71.8% of the users. We the determined the distributio race/ethnicity in each county by taking the race/ethn distribution in the Census list, weighted by the freque of each name occurring in Twitter users in that coun Due to the large amount of ambiguity in the last name race/ethnicity list (in particular, the last name list is m than 95% predictive for only 18.5% of the users), we are able to directly compare the Twitter race/ethnicity distr 1 This is effectively the census.model approach discuss prior work (Chang et al. 2010). (a) Normal representation Figure 2: Per-county over- and underrepresentation of U.S. po tation rate of 0.324%, presented in both (a) a normal layout an Blue colors indicate underrepresentation, while red colors repre to the log of the over- or underrepresentation rate. Clear trend overrepresentation of populous counties. less than 95% predictive (e.g., the name Avery was observed to correspond to male babies only 56.8% of the time; it was Undersampling Oversampling (a) Caucasian (non-hispanic) (b) African-American (c) Asian or Pacific Islander (d) Hispanic Figure 4: Per-county area cartograms of Twitter over- and undersampling rates of Caucasian, African-American, Asian, and Hispanic users, relative to the 2000 U.S. Census. Only counties with more than 500 Twitter users with inferred race/ethnicity are shown. Blue regions correspond to undersampling; red regions to oversampling.
  • 26. www.bgoncalves.com@bgoncalves Language and Geography PLoS One 8, E61981 (2013) Spanish English Geography
  • 29. www.bgoncalves.com@bgoncalves wo possible cases in networks with : ͑a͒ positively correlated nets and ͑b͒ width of the line of the links represents CAL REVIEW E 76, 066106 ͑2007͒ A C B kin = 1 kout = 2 sin = 1 sout = 2 kin = 2 kout = 1 sin = 3 sout = 1 kin = 1 kout = 1 sin = 1 sout = 2 Figure 2: Example of a meme diffusion network involving three users mentioning and retweeting each other. The val- ues of various node statistics are shown next to each node. The strength s refers to weighted degree, k stands for degree. Observing a retweet at node B provides implicit confirma- tion that information from A appeared in B’s Twitter feed, while a mention of B originating at node A explicitly con- firms that A’s message appeared in B’s Twitter feed. This may or may not be noticed by B, therefore mention edges are less reliable indicators of information flow compared to retweet edges. Retweet and reply/mention information parsed from the text can be ambiguous, as in the case when a tweet is marked as being a ‘retweet’ of multiple people. Rather, we rely on Twitter metadata, which designates users replied to or retweeted by each message. Thus, while the text of a tweet may contain several mentions, we only draw an edge to the user explicitly designated as the mentioned user by the meta- data. In so doing, we may miss retweets that do not use the explicit retweet feature and thus are not captured in the meta- data. Note that this is separate from our use of mentions as memes (§ 3.1), which we parse from the text of the tweet. 4 System Architecture Figure 3 website, memes. detailed per day lion twe process network to produ acteristic analyses sification 4.2 M A secon The Strength of Ties
  • 30. www.bgoncalves.com@bgoncalves Weak wo possible cases in networks with : ͑a͒ positively correlated nets and ͑b͒ width of the line of the links represents CAL REVIEW E 76, 066106 ͑2007͒ A C B kin = 1 kout = 2 sin = 1 sout = 2 kin = 2 kout = 1 sin = 3 sout = 1 kin = 1 kout = 1 sin = 1 sout = 2 Figure 2: Example of a meme diffusion network involving three users mentioning and retweeting each other. The val- ues of various node statistics are shown next to each node. The strength s refers to weighted degree, k stands for degree. Observing a retweet at node B provides implicit confirma- tion that information from A appeared in B’s Twitter feed, while a mention of B originating at node A explicitly con- firms that A’s message appeared in B’s Twitter feed. This may or may not be noticed by B, therefore mention edges are less reliable indicators of information flow compared to retweet edges. Retweet and reply/mention information parsed from the text can be ambiguous, as in the case when a tweet is marked as being a ‘retweet’ of multiple people. Rather, we rely on Twitter metadata, which designates users replied to or retweeted by each message. Thus, while the text of a tweet may contain several mentions, we only draw an edge to the user explicitly designated as the mentioned user by the meta- data. In so doing, we may miss retweets that do not use the explicit retweet feature and thus are not captured in the meta- data. Note that this is separate from our use of mentions as memes (§ 3.1), which we parse from the text of the tweet. 4 System Architecture Figure 3 website, memes. detailed per day lion twe process network to produ acteristic analyses sification 4.2 M A secon The Strength of Ties
  • 31. www.bgoncalves.com@bgoncalves Weak • Interviews to find out how individuals found out about job opportunities. • Mostly from acquaintances or friends of friends • “It is argued that the degree of overlap of two individuals social networks varies directly with the strength of their tie to one another” wo possible cases in networks with : ͑a͒ positively correlated nets and ͑b͒ width of the line of the links represents CAL REVIEW E 76, 066106 ͑2007͒ A C B kin = 1 kout = 2 sin = 1 sout = 2 kin = 2 kout = 1 sin = 3 sout = 1 kin = 1 kout = 1 sin = 1 sout = 2 Figure 2: Example of a meme diffusion network involving three users mentioning and retweeting each other. The val- ues of various node statistics are shown next to each node. The strength s refers to weighted degree, k stands for degree. Observing a retweet at node B provides implicit confirma- tion that information from A appeared in B’s Twitter feed, while a mention of B originating at node A explicitly con- firms that A’s message appeared in B’s Twitter feed. This may or may not be noticed by B, therefore mention edges are less reliable indicators of information flow compared to retweet edges. Retweet and reply/mention information parsed from the text can be ambiguous, as in the case when a tweet is marked as being a ‘retweet’ of multiple people. Rather, we rely on Twitter metadata, which designates users replied to or retweeted by each message. Thus, while the text of a tweet may contain several mentions, we only draw an edge to the user explicitly designated as the mentioned user by the meta- data. In so doing, we may miss retweets that do not use the explicit retweet feature and thus are not captured in the meta- data. Note that this is separate from our use of mentions as memes (§ 3.1), which we parse from the text of the tweet. 4 System Architecture Figure 3 website, memes. detailed per day lion twe process network to produ acteristic analyses sification 4.2 M A secon The Strength of Ties (1973)
  • 32. www.bgoncalves.com@bgoncalves Weak • Interviews to find out how individuals found out about job opportunities. • Mostly from acquaintances or friends of friends • “It is argued that the degree of overlap of two individuals social networks varies directly with the strength of their tie to one another” wo possible cases in networks with : ͑a͒ positively correlated nets and ͑b͒ width of the line of the links represents CAL REVIEW E 76, 066106 ͑2007͒ A C B kin = 1 kout = 2 sin = 1 sout = 2 kin = 2 kout = 1 sin = 3 sout = 1 kin = 1 kout = 1 sin = 1 sout = 2 Figure 2: Example of a meme diffusion network involving three users mentioning and retweeting each other. The val- ues of various node statistics are shown next to each node. The strength s refers to weighted degree, k stands for degree. Observing a retweet at node B provides implicit confirma- tion that information from A appeared in B’s Twitter feed, while a mention of B originating at node A explicitly con- firms that A’s message appeared in B’s Twitter feed. This may or may not be noticed by B, therefore mention edges are less reliable indicators of information flow compared to retweet edges. Retweet and reply/mention information parsed from the text can be ambiguous, as in the case when a tweet is marked as being a ‘retweet’ of multiple people. Rather, we rely on Twitter metadata, which designates users replied to or retweeted by each message. Thus, while the text of a tweet may contain several mentions, we only draw an edge to the user explicitly designated as the mentioned user by the meta- data. In so doing, we may miss retweets that do not use the explicit retweet feature and thus are not captured in the meta- data. Note that this is separate from our use of mentions as memes (§ 3.1), which we parse from the text of the tweet. 4 System Architecture Figure 3 website, memes. detailed per day lion twe process network to produ acteristic analyses sification 4.2 M A secon The Strength of Ties (1973) for a time sufficient to its ale communication network nd the calls among them links. indicates a particular egocentric network evolution. In order to quantify it, we measure the probability, p(n), that the next communication event of an agent having n social ties will occur via the establishment of a new (n 1 1)th link. We calculate these probabilities in the MPC dataset averaging them for users with the same degree k at the end of the observation time. We therefore . Panels (a), and (b) show calls within 3 hours between people in the same town in two different time windows. al network structure, which was recorded by aggregating interactions during 6 months. Node size and colors idth and color represent weight.
  • 33. www.bgoncalves.com@bgoncalves Neighborhood Overlap PNAS 104, 7333 (2007) 10 0 10 1 10 2 10 10 6 10 4 100 102 104 106 108 10 10 12 10 10 10 8 10 6 vi vj Oij=0 Oij=1/3 Oij=1Oij=2/3 <O> w ,<O> b 0 0.2 0.4 0.6 0.8 1 0 0.05 0.1 0.15 0.2 P cum (w), P cum (b) C D Degree k Link weight w (s) P(k) P(w) Fig. 1. Characterizing the large-scale structure and the tie strengths of the mobile call graph. (A and B) Vertex degree (A) and tie strength distribution (B). Each distribution was fitted with P(x) ϭ a(x ϩ x0)Ϫx exp(Ϫx/xc), shown as a blue curve, where x corresponds to either k or w. The parameter values for the fits weight betweeness reshuffled
  • 34. www.bgoncalves.com@bgoncalves Strong Ties have higher overlaps PNAS 104, 7333 (2007) A B 1 100 10 10 2 100 102 104 106 108 10 10 12 10 10 Oij=1/3 Oij=1 <O> w ,<O> b 0 0.2 0.4 0.6 0.8 1 0 0.05 0.1 0.15 0.2 P cum (w), P cum (b) C D e k Link weight w (s) g the large-scale structure and the tie strengths of the d B) Vertex degree (A) and tie strength distribution (B). tted with P(x) ϭ a(x ϩ x0)Ϫx exp(Ϫx/xc), shown as a blue onds to either k or w. The parameter values for the fits kc ϭ ϱ (A, degree), and w0 ϭ 280, ␥w ϭ 1.9, wc ϭ 3.45 ϫ tration of the overlap between two nodes, vi and vj, its r four local network configurations. (D) In the real O͘w (blue circles) increases as a function of cumulative representing the fraction of links with tie strength adic hypothesis is tested by randomly permuting the s the coupling between ͗O͘w and w (red squares). The as a function of cumulative link betweenness centrality B C 1 100 10 . one communication. 102 104 106 108 0.4 0.6 0.8 1 P cum (w), P cum (b) D Link weight w (s) the tie strengths of the strength distribution (B). p(Ϫx/xc), shown as a blue meter values for the fits 80, ␥w ϭ 1.9, wc ϭ 3.45 ϫ n two nodes, vi and vj, its rations. (D) In the real a function of cumulative links with tie strength andomly permuting the and w (red squares). The k betweenness centrality B C 1 100 10 . APPLIEDPHYSICAL SCIENCES Real Randomized Betweeness
  • 35. www.bgoncalves.com@bgoncalves Network Structure PLoS One 7, e29358 (2012) The Strength of Intermediary Ties in Social Media “People whose networks bridge the structural holes between groups have an advantage in detecting and developing rewarding opportunities. Information arbitrage is their advantage. They are able to see early, see more broadly, and translate information across groups.” AJS Volume 110 Number 2 (September 2004): 349–99 ᭧ 2004 by The University of Chicago. All rights reserved. 0002-9602/2004/11002-0004$10.00 Structural Holes and Good Ideas1 Ronald S. Burt University of Chicago This article outlines the mechanism by which brokerage prov social capital. Opinion and behavior are more homogeneous w than between groups, so people connected across groups are m familiar with alternative ways of thinking and behaving. Broke across the structural holes between groups provides a vision o tions otherwise unseen, which is the mechanism by which broke becomes social capital. I review evidence consistent with the pothesis, then look at the networks around managers in a American electronics company. The organization is rife with s tural holes, and brokerage has its expected correlates. Compensa positive performance evaluations, promotions, and good idea disproportionately in the hands of people whose networks structural holes. The between-group brokers are more likely t press ideas, less likely to have ideas dismissed, and more like have ideas evaluated as valuable. I close with implications for ativity and structural change. The hypothesis in this article is that people who stand near the hol a social structure are at higher risk of having good ideas. The argum is that opinion and behavior are more homogeneous within than betw groups, so people connected across groups are more familiar with a 1 Portions of this material were presented as the 2003 Coleman Lecture at the Univ of Chicago, at the Harvard-MIT workshop on economic sociology, in worksho the University of California at Berkeley, the University of Chicago, the Univers Kentucky, the Russell Sage Foundation, the Stanford Graduate School of Bus the University of Texas at Dallas, Universiteit Utrecht, and the “Social Aspe Rationality” conference at the 2003 meetings of the American Sociological Associ I am grateful to Christina Hardy for her assistance on the manuscript and to se colleagues for comments affecting the final text: William Barnett, James Baron athan Bendor, Jack Birner, Matthew Bothner, Frank Dobbin, Chip Heath, R Kranton, Rakesh Khurana, Jeffrey Pfeffer, Joel Podolny, Holly Raider, James R Don Ronchi, Ezra Zuckerman, and two AJS reviewers. I am especially grate Peter Marsden for his comments as discussant at the Coleman Lecture. Direc respondence to Ron Burt, Graduate School of Business, University of Chicago cago, Illinois 60637. E-mail: ron.burt@gsb.uchicago.edu
  • 36. www.bgoncalves.com@bgoncalves Network Structure PLoS One 7, e29358 (2012) ation that the stronger the tie is the higher acts of both parties it has and the higher the belong to the same group. groups to consider is the characteristics of links ese links occur mainly between groups 200 users (Figure 4A–C). However, their he quality of the links (if they bear mentions ks with mentions are less abundant than the retweets are slightly more abundant. ngth of weak ties theory [12,14–16], weak between which they take place should be small according to the Granovetter’s theory. The results show that the most likely to attract retweets are the links connecting groups that are neither too close nor too far. This can be explained with Aral’s theory about the trade-off between diversity and bandwidth: if the two groups are too close there is no enough diversity in the information, while if the groups are too far the communication is poor. These trends are not dependant on the size of the considered groups (see Figs. S6, S7, S8, S9, S10, S11, S12, S13, S14 and Table S1 in the Supplementary Information). ink statistics. (A) Size distribution of the group. (B) Distribution of the number of groups to which each user is assigned. f different types, e.g. follower links (black bars), links with mentions (red bars) or retweets (green bars), staying in particular in respect to detected groups. .0029358.g002 The Strength of Intermediary Ties in Social Media to Granovetter expectation that the stronger the number of mutual contacts of both parties it has a Figure 2. Group and link statistics. (A) Size distri (C) Percentage of links of different types, e.g. followe topological localizations in respect to detected grou doi:10.1371/journal.pone.0029358.g002 The
  • 37. www.bgoncalves.com@bgoncalves Groups PLoS One 7, e29358 (2012) Figure 4. Group-group activity. (A) Distribution of the number of links in the follower network between groups as a function of the groups. (B) Fractions f of links of the different types (follower, with mentions and with retweets) as a function of the size of the group The Strength of Intermediary Ties in So 2.4 Links between groups The next question to consider is the characteristic between groups. These links occur mainly betwee containing less than 200 users (Figure 4A–C). Howe frequency depends on the quality of the links (if they bear or retweets). While links with mentions are less abundan baseline, those with retweets are slightly more According to the strength of weak ties theory [12,14– links are typically connections between persons no neighbors, being important to keep the network conn for information diffusion. We investigate whether between groups play a similar role in the online n information transmitters. The actions more related to in diffusion are retweets [24] that show a slight prefe occurring on between-group links (Figures 4B and preference is enhanced when the similarity between groups is taken into account. We define the similarity be groups, A and B, in terms of the Jaccard index connections: similarity(A,B)~ jlinks of A and Bj j|links of A and Bj : The similarity is the overlap between the groups’ connec it estimates network proximity of the groups. The gener is that links with mentions more likely occur between clo and retweets occur between groups with medium (Figure 4D). Mentions as personal messages are exchanged between users with similar environments predicted by the strength of weak ties theory. Links with are related to information transfer and the similarity of t PLoS ONE | www.plosone.org
  • 40. www.bgoncalves.com@bgoncalves Twitter Follower Distance Social Networks 34, 73 (2012) Y. Takhteyev et al. / Social Networks 34 (2012) 73–81 f physical distances between egos and alters. The graph shows the number of ties by distance, in 200 km bins (for example, New ed towards the 5400 km bin). The total number of ties in each of the two simulations is the same as in the observed data. Based on
  • 41. www.bgoncalves.com@bgoncalves Locality Social Networks 34, 73 (2012) Y. Takhteyev et al. / Social Networks 34 (2012) 73–81 79 Table 5 Top countries. Share of egos (%)a Share of egos (%) for egos in dyadsb Share of alters (%)c Percentage of domestic tiesd Percentage of domestic ties among non-local tiesd Following foreign alters/being followed from abroad Country named explicitly (% of egos) USA 48.5 45.7 54.5 91.6 89.3 0.3 8.1 Brazil 10.6 12.1 10.5 83.5 72.5 4.9 55.4 UK 7.6 8.3 7.6 50.6 33.3 1.2 45.3 Japan 5.5 6.5 6.3 92.1 86.0 1.4 25.0 Canada 3.7 3.8 2.9 33.3 23.1 1.6 58.5 Australia 2.7 2.7 1.9 50.0 32.0 2.2 69.7 Indonesia 2.6 1.8 1.2 60.0 25.0 7.0 83.3 Germany 2.1 1.8 1.3 62.9 58.8 3.2 58.6 Netherlands 1.4 1.4 1.2 66.7 22.2 1.5 54.3 Mexico 1.2 1.3 0.7 44.0 8.3 7.0 56.7 a Out of the 2852 egos located at the level of country or better. b Out of the egos included in 1953 dyads with both parties located at the level of country or better. c Out of the 1953 alters located at the level of country or better. d The number of ties with the ego and the alter in the given country as a share of all ties for egos in that country. between those two interpretations. We also note that top Twitter clusters intersect only to an extent with Alderson and Beckfield’s (2004) ranking of world cities based on multinational corporations’ branch headquarters. (Of Alderson and Beckfield’s top 25 cities by in-degree or “prestige,” 13 appear in the top 25 Twitter clusters ranked by in-degree centrality, with another 6 appearing in top 100.) 5.3. National borders Of the ties that were matched to countries, 75 percent con- nect users in the same country. This prevalence of domestic ties is Table 6 The most common languages. Based on 2852 egos. Language % of egos English 72.5 Portuguese 10.1 Japanese 5.4 Spanish 3.1 Indonesian 1.8 German 1.7 Dutch 1.0 Chinese 0.9 Korean 0.4 Swedish 0.4 Y. Takhteyev et al. / Social Networks 34 (2012) 73–81 77 accounts, by randomly drawing an account from among those “fol- lowed” by each of those egos. We then coded the locations of the alters using the same procedure as we did for the egos, removing those pairs where the alter could not be assigned to a country. In the end, we obtained a sample of 1953 ego-alter pairs with both the ego and the alter assigned to a country, including 1259 pairs with “specific” locations for both parties (Table 1). 4.4. Aggregating nearby locations Since specific locations vary substantially in precision and since users can often choose between a range of specific names for the same place (e.g., “Palo Alto” vs. “Silicon Valley” vs. “SF Bay”), we aggregated nearby locations within each country, by assigning a set of coordinates (obtained from Google Maps) to each location smaller than 25,000 km2 and then merging nearby locations within each country by replacing their coordinates with a weighted aver- age of the coordinates of the merged locations. This reduced our location descriptions to a set of 386 regional clusters, which are comparable in size to metropolitan areas. We labeled each clus- ter with the most common name associated with it in our sample. For example, the cluster centered on Manhattan is referred to as “New York.” 5. Analysis In this section we analyze the factors affecting the formation of Twitter ties. We first look at the effect of each variable identified earlier based on theoretical considerations: the actual physical dis- tance, the frequency of air travel, national boundaries, and language differences. In addition to presenting the descriptive statistics demonstrating the effects of each variable and investigating the nature of such effects, we correlated the effects using the Quadratic Assignment Procedure (QAP, Krackhardt, 1987; Butts, 2007). In the last subsection we also examined the relationship between the variables using QAP regression (Double Dekker Semi-partialling MRQAP). All statistical calculations were done using UCINet 6.277 (Borgatti et al., 2002). For correlation and regression analysis we used networks with nodes representing the 25 largest regional clusters of users (see Table 3 Top clusters. Rank Clustera Share of egos (%)b Share of egos (%) for egos in dyadsc Share of alters (%)d Localitye 1 “New York” 8.5 8.3 10.2 54.3 2 “Los Angeles, CA” 5.1 5.6 10.4 53.3 3 “ ” (Tokyo) 4.1 4.8 5.0 62.9 4 “London” 3.6 3.3 4.9 48.8 5 “São Paulo” 3.5 3.0 3.6 78.4 6 “San Francisco” 2.8 2.7 4.1 41.2 7 “New Jersey”f 2.5 2.8 2.1 20.0 8 “Chicago” 2.2 2.0 1.7 32.0 9 “Washington, DC” 2.1 2.8 2.6 34.3 10 “Manchester, UK” 1.9 2.0 1.1 30.8 11 “Atlanta” 1.7 2.1 2.1 46.2 12 “San Diego” 1.5 1.5 1.1 26.3 13 “Toronto, Canada” 1.3 1.1 1.5 42.9 14 “Seattle” 1.3 1.4 1.2 58.8 15 “Houston” 1.2 1.2 1.0 40.0 16 “Dallas, Texas” 1.2 1.0 1.4 61.5 17 “Rio de Janeiro” 1.2 1.0 1.1 30.8 18 “Boston, MA” 1.2 1.2 1.1 20.0 19 “Amsterdam” 1.1 1.1 0.9 50.0 20 “Jakarta, Indonesia” 1.1 0.6 0.3 42.9 21 “Austin, TX” 1.0 1.0 1.3 50.0 22 “Sydney” 0.9 1.0 0.8 38.5 23 “Orlando, Forida” 0.9 1.0 0.6 16.7 24 “Phoenix, AZ” 0.8 0.7 0.6 11.1 25 “ ” (Hy¯ogo)g 0.8 1.0 1.0 25.0 a Each cluster is labeled with the name most frequently used for locations assigned to the cluster. b Out of the 2167 egos located with precision of <25,000 km2 . c Out of the 1259 egos included in dyads with both parties located with precision of <25,000 km2 . d Out of the 1259 alters included in dyads with both parties located with precision of <25,000 km2 . e Defined as the share of local of ties among all ties for egos in a cluster. f Centered between Philadelphia and Trenton, NJ and includes all locations iden- tified as just “New Jersey”. g Centered near the boundary between Hy¯ogo and Osaka prefectures, in the Kansai region of Japan. over half of the egos are in other countries, as are 4 of the 10 largest clusters: Tokyo, São Paulo, and two clusters in the United
  • 43. www.bgoncalves.com@bgoncalves Population Heterogeneity Social Networks 34, 82 (2012) • Bernoulli process to generate adjacency matrix given a distance matrix between nodes
 
 
 • Above some density threshold, networks is naturally connected.et al. / Social Networks 34 (2012) 82–100 85 Fig. 3. Emergence of local connectivity on an uneven population density surface. Where the threshold population density for an approximately uniform region of P (A = a|D) = Y {i,j} B (Aij = aij|F (Dij, ✓)) ! ( )
  • 44. www.bgoncalves.com@bgoncalves Vertex Placement Social Networks 34, 82 (2012) 88 C.T. Butts et al. / Social Networks 34 (2012) 82–100 Fig. 5. Comparison of uniform and quasi-random vertex placement, Quay County, NM MSA. Lines indicate census block boundaries, with artificial elevation shown via vertex color. Insets provide detail of 2 km × 2 km portion of Tucumcari, NM. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of the article.)
  • 45. www.bgoncalves.com@bgoncalves Friendship probability Social Networks 34, 82 (2012) • Probability that two people are friends as a function of distance:
 
 
 • with (0.533, 0.032, 2.788) for “social friendships” and (0.859, 0.035, 6.437) for “face- to-face interactions”. F (d) = ✓1 (1 + ✓2d) ✓3
  • 46. www.bgoncalves.com@bgoncalves Social Network Properties Social Networks 34, 82 (2012)C.T. Butts et al. / Social Networks 34 (2012) 82–100 97 Fig. 12. Marginal degree distributions by location, SIF, and placement model. Friendship model distributions are shown in blue, interaction model distributions in black; solid lines indicate uniform placement, with quasi-random placement in dotted lines. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of the article.)
  • 47. www.bgoncalves.com@bgoncalves Co-occurences and Social Ties PNAS 107, 22436 (2010) • Geotagged Flickr Photos • Divide the world into a grid
 
 
 
 
 
 
 
 
 
 
 
 Count number of cells on which two individuals were within a given interval randomly selected Flickr users have a 0.0134% chance of having a social tie, but when two users have multiple spatio-temporal co- A Model of Spatio-Temp small number of co-occu greater probabilities of a investigation of the und basic effect is a robust on models of social netwo probabilistic model for h We begin with a simpl matches the observed d To formulate the sim divided into N geograp There are M people, eac network consists of M∕ friends chooses to visit dependently with proba location(s) is made un the probability that two they visit exactly the sam A Jan 3 + A Jan 1 + A Jan 6 + A Jan 5 + B Jan 2 + B Jan 1 + B Jan 7 + B Jan 8 + A Jan 1 + B Jan 1 + s s A Jan 8 + B Jan 1 + Fig. 1. Illustration of how spatio-temporal co-occurrences are counted, for some sample time-stamped observations of individuals A and B. The world is divided into discrete cells of size s × s, and we count the number of cells k in which the two individuals have been observed within a time threshold of t days—in this case, k ¼ 3 when t is 2.
  • 48. www.bgoncalves.com@bgoncalves Co-occurences and Social Ties PNAS 107, 22436 (2010) 0 5 10 15 20 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 # of contemporaneous events Probabilityoffriendship 1 day 7 days 14 days 28 days 1 year 0 5 10 15 20 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 # of contemporaneous events Probabilityoffriendship 1 day 7 days 14 days 28 days 1 year 0 5 10 15 20 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 # of contemporaneous events Probabilityoffriendship 1 day 7 days 14 days 28 days 1 year 0 5 10 15 20 10 3 10 2 10 1 10 0 # of contemporaneous events Probabilityoffriendship 1 day 7 days 14 days 28 days 1 year 0 5 10 15 20 10 3 10 2 10 1 10 0 # of contemporaneous events Probabilityoffriendship 1 day 7 days 14 days 28 days 1 year 0 5 10 15 20 10 3 10 2 10 1 10 0 # of contemporaneous events Probabilityoffriendship 1 day 7 days 14 days 28 days 1 year A B C 0.6 0.7 0.8 0.9 1 friendship 1 day 7 days 14 days 28 days 1 year 0.6 0.7 0.8 0.9 1 friendship 1 day 7 days 14 days 28 days 1 year s = 0 .001• s = 0 .01• s = 0 .1•
  • 49. www.bgoncalves.com@bgoncalves Human Mobility Nature 453, 779 (2008) on should increase with time as rg(t) , t for an RW, rg(t) , t1/2 ; that is, the longer we higher the chance that she/he will travel to areas To check the validity of these predictions, we dependence of the radius of gyration for users ius would be considered small (rg(T) # 3 km), ) # 30 km) or large (rg(T) . 100 km) at the end period (T 5 6 months). The results indicate that is, F(x) , x for x , 1 and F(x) rapidly decreases for x ? 1. Therefore, the travel patterns of individual users may be approxi- mated by a Le´vy flight up to a distance characterized by rg. Most important, however, is the fact that the individual trajectories are bounded beyond rg; thus, large displacements, which are the source of the distinct and anomalous nature of Le´vy flights, are statistically absent. To understand the relationship between the different expo- nents, we note that the measured probability distributions are related n mobility patterns. a, Week-long trajectory of 40 ndicates that most individuals travel only over short egularly move over hundreds of kilometres. b, The a single user. The different phone towers are shown as for each location is shown as a vertical bar. The circle represents the radius of gyration centred in the trajectory’s centre of mass. c, Probability density function P(Dr) of travel distances obtained for the two studied data sets D1 and D2. The solid line indicates a truncated power law for which the P ( r) = ( r + r0) exp ( rg/) Cell Phones
  • 50. www.bgoncalves.com@bgoncalves Human Mobility Nature 453, 779 (2008) Received 19 December 2007; accepted 27 March 2008. 20. Baraba´si, A.-L. The origin of bursts and heavy tails in human dynamics. Nat –15 0 15 –15 0 15 –150 0 150 –150 0 150 –1,200 0 1,200 –1,200 0 1,200 Fmin F max x (km) x (km) x (km) y(km) y(km) y(km)y/sy y/sy x/sx 0.1 0.2 0.3 0.4 sy/sx 10 100 1,0001 rg (km) –15 0 15 –15 0 15 –15 0 15 –15 0 15 –15 0 15 –15 0 15 –10 –5 0 5 10 10–6 10–4 10–2 100 (x/sx,0) rg ≤ 3 km 20 km < rg < 30 km rg > 100 km ~ 10–2 10–3 10–4 10–5 10–6 a b dc y/sy x/sx x/sx x/sx F Figure 3 | The shape of human trajectories. a, The probability density function W(x, y) o finding a mobile phone user in a location (x, y the user’s intrinsic reference frame (see Supplementary Information for details). The three plots, from left to right, were generated 10,000 users with: rg # 3, 20 , rg # 30 and rg . 100 km. The trajectories become more anisotropic as rg increases. b, After scaling ea position with sx and sy, the resulting ~WW x=sx,y sy À Á has approximately the same sh for each group. c, The change in the shape of W(x, y) can be quantified calculating the isotr ratio S ; sy/sx as a function of rg, which decre as S*r{0:12 g (solid line). Error bars represent standard error. d, ~WW x=sx,0ð Þ representing th x-axis cross-section of the rescaled distributi ~WW x=sx,y sy À Á shown in b. LETTERS NATURE|Vol 453|5 June 20 Cell Phones
  • 51. www.bgoncalves.com@bgoncalves Privacy Sci Rep 3, 1376 (2013) function fits the data better than other two-parameters functions such as a 2 exp (lx), a stretched exponential a 2 exp xb , or a standard linear function a 2 bx (see Table S1). Both estimators for a and b are highly significant (p , 0.001)32 , and the mean pseudo-R2 is 0.98 for the Ip54 case and the Ip510 case. The fit is good at all levels of spatial and temporal aggregation [Fig. S3A–B]. The power-law dependency of e means that, on average, each time the spatial or temporal resolution of the traces is divided by two, their uniqueness decreases by a constant factor , (2)2b . This implies that privacy is increasingly hard to gain by lowering the resolution of a dataset. to larger populations, or geographies. An increase in population density will tend to decrease e. Yet, it will also be accompanied by an increase in the number of antennas, businesses or WiFi hotspots used for localizations. These effects run opposite to each other, and therefore, suggest that our results should generalize to higher popu- lation densities. Extensions of the geographical range of observation are also unlikely to affect the results as human mobility is known to be highly circumscribed. In fact, 94% of the individuals move within an average radius of less than 100 km17 . This implies that geographical exten- sions of the dataset will stay locally equivalent to our observations, Figure 2 | (A) Ip52 means that the information available to the attacker consist of two 7am-8am spatio-temporal points (I and II). In this case, the target was in zone I between 9am to 10am and in zone II between 12pm to 1pm. In this example, the traces of two anonymized users (red and green) are compatible with the constraints defined by Ip52. The subset S(Ip52) contains more than one trace and is therefore not unique. However, the green trace would be uniquely characterized if a third point, zone III between 3pm and 4pm, is added (Ip53). (B) The uniqueness of traces with respect to the number p of given spatio-temporal points (Ip). The green bars represent the fraction of unique traces, i.e. |S(Ip)| 5 1. The blue bars represent the fraction of |S(Ip)| # 2. Therefore knowing as few as four spatio-temporal points taken at random (Ip54) is enough to uniquely characterize 95% of the traces amongst 1.5 M users. (C) Box-plot of the minimum number of spatio-temporal points needed to uniquely characterize every trace on the non-aggregated database. At most eleven points are enough to uniquely characterize all considered traces. www.nature.com/scientificreports of spatial and temporal aggregation [Fig. S3A–B]. The power-law dependency of e means that, on average, each time the spatial or temporal resolution of the traces is divided by two, their uniqueness decreases by a constant factor , (2)2b . This implies that privacy is increasingly hard to gain by lowering the resolution of a dataset. Fig. 2B shows that, as expected, e increases with p. The mitigating effect of p on e is mediated by the exponent b which decays linearly with p: b 5 0.157 2 0.007p [Fig. 4E]. The dependence of b on p implies that a few additional points might be all that is needed to identify an individual in a dataset with a lower resolution. In fact, given four points, a two-fold decrease in spatial or temporal resolu- tion makes it 9.3% less likely to identify an individual, while given ten points, the same two-fold decrease results in a reduction of only 6.2% (see Table S1). Because of the functional dependency of e on p through the expo- nent b, mobility datasets are likely to be re-identifiable using information on only a few outside locations. Discussion Our ability to generalize these results to other mobility datasets depends on the sensitivity of our analysis to extensions of the data lation densities. Extensions of the geographical range of observation are also unlikely to affect the results as human mobility is known to be highly circumscribed. In fact, 94% of the individuals move within an average radius of less than 100 km17 . This implies that geographical exten- sions of the dataset will stay locally equivalent to our observations, making the results robust to changes in geographical range. From an inference perspective, it is worth noticing that the spatio- temporal points do not equally increase the likelihood of uniquely identifying a trace. Furthermore, the information added by a point is highly dependent from the points already known. The amount of information gained by knowing one more point can be defined as the reduction of the cardinality of S(Ip) associated with this extra point. The larger the decrease, the more useful the piece of information is. Intuitively, a point on the MIT campus at 3AM is more likely to make a trace unique than a point in downtown Boston on a Friday evening. This study is likely to underestimate e, and therefore the ease of re- identification, as the spatio-temporal points are drawn at random from users’ mobility traces. Our Ip are thus subject to the user’s spatial and temporal distributions. Spatially, it has been shown that the uncertainty of a typical user’s whereabouts measured by its 10 6 10 5 10 4 10 3 10 0 10 1 10 2 10 3 Number of antennas Inhabitants Probabilitydensityfunction Median inter-interactions time per user [h] 0 12 24 36 48 60 72 84 96 10 0 10 -1 10 -2 10 -3 10 -4 10 0 10 -1 10 -2 10 -3 10 -4 10 -5 0 500 1000 1500 2000 2500 Number of interactions Probabilitydensityfunction A B C Figure 3 | (A) Probability density function of the amount of recorded spatio-temporal points per user during a month. (B) Probability density function of the median inter-interaction time with the service. (C) The number of antennas per region is correlated with its population (R2 5 .6426). These plots strongly emphasize the discrete character of our dataset and its similarities with datasets such as the one collected by smartphone apps.Cell Phones
  • 52. www.bgoncalves.com@bgoncalves Privacy Sci Rep 3, 1376 (2013) Temporal resolution [h] Spatialresolution[v] 1 cell 3 cells 5 cells 7 cells 9 cells 11 cells 13 cells Temporal resolution [h] NormalizeduniquenessoftracesSpatialresolution[v] Temporal resolution [h] A B Spatial resolution [v] Normalizeduniquenessoftraces C D 15 13 11 9 7 5 3 1 1 3 5 7 9 11 13 15 15 13 11 9 7 5 3 1 1 3 5 7 9 11 13 15 10 0 10 0 10 0 10 1 10 0 10 1 0.10 0.14 β E 1 hour 3 hours 5 hours 7 hours 9 hours 11 hours 13 hours Uniqueness of traces0.70 Uniqueness of traces0.70 www.nature.com/scientificreports Cell Phones
  • 53. www.bgoncalves.com@bgoncalves Gravity Law of Commuting PNAS 106, 21484 (2009) wij i j US county commuting network each node i : subpopulation (census area) each link (ij) : interaction between subpopulations i and j weight wij : number of people commuting from i to j per unit time
  • 54. www.bgoncalves.com@bgoncalves Gravity Law of Commuting PNAS 106, 21484 (2009) w (D) /w (M) C) E) F) Distance (km) w (D) /(NN) 10 -5 10 -4 10 -3 10 -2 10 - 10 10 10 -2 10 0 10 2 10 - 10 10 Population of origin w (D) /w (M) w (D) /w (M) 0 100 200 300 10 2 10 10 6 10 8 D) ij ! A) B)10 1 10 3 10 5 10 1 10 3 10 5 C) Distance (km) w (D) /(NN) Distance (km) 10 -5 10 -4 10 -3 10 -2 10 -2 10 0 10 2 w (D) /w (M) 0 100 200 300 0 100 200 300 D) ij ! A)10 1 10 3 10 5 w (D) /w (M) C) E) F) Distance (km) w (D) /(NN) Distance (km) 10 -5 10 -4 10 -3 10 -2 10 -2 10 0 10 2 10 -2 10 0 10 2 10 2 10 4 10 6 10 8 10 -2 10 0 10 2 Population of destinationPopulation of origin w (D) /w (M) w (D) /w (M) 0 100 200 300 0 100 200 300 10 2 10 10 6 10 8 D) ij ! A) B) w (D) /w (M) C) E) F) Distance (km) w (D) /(NN) Distance (km) 10 -5 10 -4 10 -3 10 -2 10 -2 10 0 10 2 10 -2 10 0 10 2 10 2 10 4 10 6 10 8 10 -2 10 0 10 2 Population of destinationPopulation of origin w (D) /w (M) w (D) /w (M) 0 100 200 300 0 100 200 300 10 2 10 10 6 10 8 D) ij ! A) B) 136 D. Balcan et al. / Journal of Computational Science 1 (2010) 132–145 Table 1 Commuting networks in each continent. Number of countries (N), number of admin- istrative units (V) and inter-links between them (E) are summarized. Continent N V E Europe 17 65,880 4,490,650 North America 2 6986 182,255 Latin America 5 4301 102,117 Asia 4 4355 380,385 Oceania 2 746 30,679 Total 30 82,268 5,186,186 commuting. This allows to deal with self-similar units across the world with respect to mobility as emerged from the tessellation and not country specific administrative boundaries. We have therefore mapped the different levels of commuting data into the geographi- cal census areas formed by the Voronoi-like tessellation procedure described above. The mapped commuting flows can be seen as a second transport network connecting subpopulations that are geo- graphically close. This second network can be overlaid to the WAN in a multi-scale fashion to simulate realistic scenarios for disease spreading. The network exhibits important variability in the num- ber of commuters on each connection as well as in the total number of commuters per geographical census area. Being the census areas statistically homogeneous we can also extract a general statistical law that allows for the synthetic generation of commuting net- works in countries where real data are not available. A full account of the commuting data obtained across different continents and their statistical analysis can be found in Ref. [2]. 3.3. Disease model Table 2 Transitions between compartments and their rates. Transition Type Rate Sj → Lj Contagion j Lj → Ia j Spontaneous εpa Lj → It j ε(1 − pa)pt Lj → Int j ε(1 − pa)(1 − pt) Ia j → Rj It j → Rj Int j → Rj general, the force of infection is assumed to follow the mass action principle for which the infection rate is = ˇI / N where ˇ is the infection transmission rate and I / N is the density of infected indi- viduals in the population. In the case of asymptomatic individuals the force of infection is usually reduced by a factor rˇ. In the case of multiple interacting subpopulations and different classes of infec- tives the force of infection will be the sum of different contributions as reported in Section 4.3. Given the force of infection j in subpopulation j, each person in the susceptible compartment (Sj) contracts the infection with probability j t and enters the latent compartment (Lj), where t is the time interval considered. Latent individuals exit the compart- ment with probability ε t, and transit to asymptomatic infectious compartment (Ia j ) with probability pa or, with the complemen- tary probability 1 − pa, become symptomatic infectious. Infectious persons with symptoms are further divided between those who can travel (It j ), probability pt, and those who are travel-restricted (Int j ) with probability 1 − pt. All the infectious persons permanently recover with probability t, entering the recovered compartment
  • 55. www.bgoncalves.com@bgoncalves Mobility and Social Networks Coupling Mobility and Interactions in Social Media and for their dependence on the distance. The error Err of this null model is between 0:66–0:76 for the three countries, around twice the error of the TF model (see Figure 6). The linking model (L model) is a simplified version of the TF model, without random mobility and the box size d?0. Agents move to visit their contacts with probability pv, whereas with probability 1{pv they do not perform any action. In this version of the model, users can connect only by random connections or when two of them coincide, visiting a common friend, which leads to triadic closure. These two processes do not depend on the distances between the users. A thorough description can be obtained with a mean-field approach (see the corresponding section). The results of the L model are shown in Figure 2. Due to the triangle closing mechanism, this null model creates networks with a considerable level of clustering. However, it does not (e.g., for the US the TF model has Err lower by 0:5 and 1:5 than the TF-normal and the TF-uniform models, respectively, as shown in Figure 6). Simplified models that neglect either geography or network structure perform considerably worse than the TF model in reproducing the properties of real networks. Likewise, non-realistic assumptions on human mobility mechanism yield worse results than the default TF model. To conclude, the coupling of geography and structure through a realistic mobility mechanism produces networks with significantly more realistic geographic and structural properties. Sensitivity of the TF Model to the Parameters and its Modifications The results presented so far have been obtained at the optimal Figure 4. Simulation results: mobility and social networks. Mobility (upper row) and ego networks (lower row) of 20 random users (different colors) for the instances of the TF model yielding the lowest error Err (see Figure 3). Mobility network shows mobility patterns of individual users throughout entire simulation. Ego network shows the social connections at the end of the simulation. doi:10.1371/journal.pone.0092196.g004 PLoS One 9, E92196 (2014)
  • 56. www.bgoncalves.com@bgoncalves Geo-Social Properties Couplin that has also an edge between i and k, forming a triangle. Note a triangle consists of 3 triads centered on different nodes. effect of the distance on the clustering coefficient can incorporated by measuring the distances from each central n j to two neighbors i and k forming a triad, d~dijzdjk, calculating the network clustering restricted to triads with dist d. This new function C(d) is the probability of closing a tria given the distance d in a triad C(d)~ D(d) L(d) , where (d) and (d) are the numbers of triads and closed tr for the distance d, respectively. The value of the global cluste coefficient C can be recovered by averaging C(d) over d. In datasets, we observe a drop in C(d) followed by a plateau, whi best visible for the US networks (Figure 2E). Given a triangle, several configurations are possible if the diversity in the edge lengths. The triangle can be equilateral the edges have the same length, isosceles if two have the s length and the other is smaller, etc. We estimate the domi shapes of the triangles in the network by measuring the dispari defined as: D~6 d2 1 zd2 2 zd2 3 (d1zd2zd3)2 { 1 3 , where d1, d2 and d3 are the geographical distances between locations of the users forming the triangle. The disparity t values between 0 and 1 as the shape of the triangle passes f equilateral to isosceles, where one edge is much smaller than other two. D shows a distribution with two maxima in the on social networks (Figure 2F), for low and high values. The two m C(d). doi:10.1371/journal.pone.0092196.g002 DL eo-social properties. Various statistical network properties are plotted for the data obtained from Twitter (red squares), ), Brightkite (green triangles) and the null models (dashed lines), for the US (for the UK and Germany, see Figures S1 and S2). Coupling Mobility and Interactions in Social Media Triangle Disparity eo-social properties. Various statistical network properties are plotted for the data obtained from Twitter (red squares), ), Brightkite (green triangles) and the null models (dashed lines), for the US (for the UK and Germany, see Figures S1 and S2). enta), based on geography, matches well the data in Pl(d), but yields near-zero values for R(d), Jf (d) and C(d). The linking triadic closure, produces enough clustering, but it does not reproduce the distance dependencies of Pl(d), R(d), Jf (d) and e.0092196.g002 Coupling Mobility and Interactions in Social Media Reciprocity Figure 2. Network geo-social properties. Various statistical network properties are plotted for the data obtained from Twitter (red squares), Gowalla (blue diamonds), Brightkite (green triangles) and the null models (dashed lines), for the US (for the UK and Germany, see Figures S1 and S2). The spatial model (magenta), based on geography, matches well the data in Pl(d), but yields near-zero values for R(d), Jf (d) and C(d). The linking model (cyan), based on triadic closure, produces enough clustering, but it does not reproduce the distance dependencies of Pl(d), R(d), Jf (d) and C(d). doi:10.1371/journal.pone.0092196.g002 Coupling Mobility and Interactions in Social Media Prob of a Link ocial properties. Various statistical network properties are plotted for the data obtained from Twitter (red squares), ightkite (green triangles) and the null models (dashed lines), for the US (for the UK and Germany, see Figures S1 and S2). Coupling Mobility and Interactions in Social Media Clustering PLoS One 9, E92196 (2014)
  • 57. www.bgoncalves.com@bgoncalves Geo-Social Model New position of u { { { Detect all encounters e in the box of u Visit a random neighbour Jump to a new location Starting position of user u Created new social links PLoS One 9, E92196 (2014)
  • 58. www.bgoncalves.com@bgoncalves Model Fitting 0:39 for Germany. For simplicity, we focus on the Twitter networks only, although similar results are obtained for the other datasets. Results Simulations for the Optimal Parameters An example with the displacements between the consecutive locations and the ego networks for a sample of individuals, as generated by the TF model, are displayed in Figure 4. The parameters of the model are set to the ones that correspond to the minimum of the error Err. As shown, the agents tend to stay close to their original positions. Occasional long jumps occur due to friend visits that live far apart. In this range of parameters and simulation times, the main mechanism for generating long distance second null model, the linking model (L model), in contrast, is based only on random linking and triadic closure, and it is equivalent to the TF model without the mobility. We consider the two uncoupled null models and compare their results with those of the TF model. In this way, we demonstrate the importance of the coupling through a realistic mobility mechanism to reproduce the empirical networks. The spatial model (S model) consists of randomly connecting pair of users with a probability that decays as power-law of the distance between them (suggested in [41]). The exponent of the power-law is fixed at {0:7 following Figure 2A. The results of the S model are shown in the panels of Figure 2. While it is set to match Pl dð Þ, other properties such as P(k), R dð Þ, Jf dð Þ, C dð Þ or P Dð Þ are not well reproduced. The S model fails to account for the high level of clustering and reciprocity in the empirical networks Figure 3. Fitting the TF model. Values of the error Err when pv and pc are changed. The minimum error for each of the plots is marked with a red rectangle. doi:10.1371/journal.pone.0092196.g003 PLOS ONE | www.plosone.org 5 March 2014 | Volume 9 | Issue 3 | e92196 Prob. to Make a New Friend Prob.toVisitanOldFriend PLoS One 9, E92196 (2014)
  • 59. www.bgoncalves.com@bgoncalves perties of the model networks. Various statistical properties are plotted for the networks obtained from Twitter data lation of the TF model (black line) for the US. Corresponding results for the UK and Germany can be found in Figures S3 Coupling Mobility and Interactions in Social Media al properties of the model networks. Various statistical properties are plotted for the networks obtained from Twitter data m simulation of the TF model (black line) for the US. Corresponding results for the UK and Germany can be found in Figures S3 Coupling Mobility and Interactions in Social Media Model Results Reciprocity Clustering Triangle Disparity andom connections, and so the distribution of triangles disparity prevent Figure 5. Geo-social properties of the model networks. Various statistical pro red squares) and from simulation of the TF model (black line) for the US. Correspond nd S4. doi:10.1371/journal.pone.0092196.g005 that has also an edge between i and k, forming a triangle. Note a triangle consists of 3 triads centered on different nodes. effect of the distance on the clustering coefficient can incorporated by measuring the distances from each central n j to two neighbors i and k forming a triad, d~dijzdjk, calculating the network clustering restricted to triads with dist d. This new function C(d) is the probability of closing a tria given the distance d in a triad C(d)~ D(d) L(d) , where (d) and (d) are the numbers of triads and closed tr for the distance d, respectively. The value of the global cluste coefficient C can be recovered by averaging C(d) over d. In datasets, we observe a drop in C(d) followed by a plateau, whi best visible for the US networks (Figure 2E). Given a triangle, several configurations are possible if the diversity in the edge lengths. The triangle can be equilateral the edges have the same length, isosceles if two have the s length and the other is smaller, etc. We estimate the domi shapes of the triangles in the network by measuring the dispari defined as: D~6 d2 1 zd2 2 zd2 3 (d1zd2zd3)2 { 1 3 , where d1, d2 and d3 are the geographical distances between locations of the users forming the triangle. The disparity t values between 0 and 1 as the shape of the triangle passes f equilateral to isosceles, where one edge is much smaller than other two. D shows a distribution with two maxima in the on social networks (Figure 2F), for low and high values. The two m C(d). doi:10.1371/journal.pone.0092196.g002 DL Figure 5. Geo-social properties of the model networks. Various statistical properties are plotted for the networks obtaine (red squares) and from simulation of the TF model (black line) for the US. Corresponding results for the UK and Germany can be and S4. doi:10.1371/journal.pone.0092196.g005 Coupling Mobility and Interactio s, and so the distribution of triangles disparity prevents the model from producing networks with characteristics al properties of the model networks. Various statistical properties are plotted for the networks obtained from Twitter data m simulation of the TF model (black line) for the US. Corresponding results for the UK and Germany can be found in Figures S3 one.0092196.g005 Coupling Mobility and Interactions in Social Media Prob of a Link PLoS One 9, E92196 (2014)
  • 60. www.bgoncalves.com@bgoncalves Human Diffusion J. R. Soc. Interface 12, 20150473 (2015)
  • 61. www.bgoncalves.com@bgoncalves Human Diffusion J. R. Soc. Interface 12, 20150473 (2015) Starting from Paris Starting from New York a b
  • 62. www.bgoncalves.com@bgoncalves Human Diffusion J. R. Soc. Interface 12, 20150473 (2015) Starting from New Yorkb
  • 63. www.bgoncalves.com@bgoncalves Residents and Tourists J. R. Soc. Interface 12, 20150473 (2015)
  • 64. www.bgoncalves.com@bgoncalves Residents and Tourists J. R. Soc. Interface 12, 20150473 (2015) 50 100 150 200 250 300 350 0.1 0.2 0.3 0.4 0.5 0.6 Coverage R ~ Local Non−Local a 100 200 300 400 500 600 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Proportion of Non−Local Users Coverage b 125 135 145 155 New York Chicago San Francisco Shanghai Dallas Berlin Paris Saint Petersburg Beijing Moscow Coverage c 325 335 345 Houston Barcelona Brussels Detroit Lima Istanbul Rome Moscow Paris Lisbon Coverage d
  • 65. www.bgoncalves.com@bgoncalves City Communities J. R. Soc. Interface 12, 20150473 (2015) 0 2 4 6 8 10 Los Angeles San Francisco Miami Singapore Tokyo Paris London New York Weighted Betwennness (x 102 ) Weighted degree
  • 66. www.bgoncalves.com@bgoncalves Angkor Wat Forbidden City Corcovado Eiffel Tower Giza Golden PavilionGrand Canyon Hagia Sophia Iguazu Falls Kukulcan London Tower Machu Pichu Mount Fuji Niagara Falls Taj Mahal Pisa Tower Times Square Zocalo Saint Basil's Cathedral Ahlambra
  • 67. www.bgoncalves.com@bgoncalves Angkor Wat Forbidden City Corcovado Eiffel Tower Giza Golden PavilionGrand Canyon Hagia Sophia Iguazu Falls Kukulcan London Tower Machu Pichu Mount Fuji Niagara Falls Taj Mahal Pisa Tower Times Square Zocalo Saint Basil's Cathedral Ahlambra
  • 69. www.bgoncalves.com@bgoncalves Touristic Sites 0.4 0.5 0.6 Radius Times Square Niagara Falls Angkor Wat Grand Canyon Machu Pichu Giza Forbidden City Eiffel Tower Pisa Tower Taj Mahal 80 90 100 110 120 Coverage (cell) Iguazu Falls Giza Times Square Machu Pichu Forbidden City Niagara Falls Eiffel Tower Taj Mahal Grand Canyon Pisa Tower 20 24 28 32 Coverage (country) London Tower Times Square Hagia Sophia Machu Pichu Angkor Wat Forbidden City Pisa Tower Eiffel Tower Giza Taj Mahal (a) (b) (c) EPJ Data Science 5, 12 (2016)
  • 71. www.bgoncalves.com@bgoncalves Discussion • Online Social Networks generate unprecedented amounts of data on Human Behavior • The massification of GPS-enabled devices allows us to observe Geographical variations • Human mobility is an intrinsically multi-scale process • Twitter is a good source of geolocated data, but it has many biases that must be considered • Different types of links serve different social and information diffusion functions • The strength of ties provides important clues to the social structure • Colocation increases the likelihood of friendship • Mobility and Social Structure mutually influence each other • Mobility is a proxy for the centrality of a city or touristic locale
  • 72. www.bgoncalves.com@bgoncalves References PLoS One 10, e0115545 (2015) Sci Rep 3, 1376 (2013) Social Networks 34, 73 (2012) Social Networks 34, 82 (2012) PLoS One 7, e29358 (2012) ICWSM’11, 375 (2011) PNAS 107, 22436 (2010) Nature 453, 779 (2008) PNAS 104, 7333 (2007) EPJ Data Science 5, 12 (2016) J. R. Soc. Interface 12, 20150473 (2015) PLoS One 9, E92196 (2014) PLoS One 8, E61981 (2013) ICWSM’11, 89 (2011) PNAS 106, 21484 (2009)