Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Twitterology - The Science of Twitter

1,065 views

Published on

An overview of several recent results obtained using Twitter data.

Published in: Science
  • Be the first to comment

Twitterology - The Science of Twitter

  1. 1. Bruno Gonçalves www.bgoncalves.com Twitterology:
 The Science of Twitter
  2. 2. www.bgoncalves.com@bgoncalves The Internet In Real Time
  3. 3. www.bgoncalves.com@bgoncalves www.bgoncalves.com@bgoncalves
  4. 4. www.bgoncalves.com@bgoncalves Social Media
  5. 5. www.bgoncalves.com@bgoncalves Twitter
  6. 6. Data
  7. 7. www.bgoncalves.com@bgoncalves Anatomy of a Tweet
  8. 8. www.bgoncalves.com@bgoncalves Anatomy of a Tweet
  9. 9. www.bgoncalves.com@bgoncalves Anatomy of a Tweet [u'contributors', u'truncated', u'text', u'in_reply_to_status_id', u'id', u'favorite_count', u'source', u'retweeted', u'coordinates', u'entities', u'in_reply_to_screen_name', u'in_reply_to_user_id', u'retweet_count', u'id_str', u'favorited', u'user', u'geo', u'in_reply_to_user_id_str', u'possibly_sensitive', u'lang', u'created_at', u'in_reply_to_status_id_str', u'place', u'metadata'] [u'follow_request_sent', u'profile_use_background_image', u'default_profile_image', u'id', u'profile_background_image_url_https', u'verified', u'profile_text_color', u'profile_image_url_https', u'profile_sidebar_fill_color', u'entities', u'followers_count', u'profile_sidebar_border_color', u'id_str', u'profile_background_color', u'listed_count', u'is_translation_enabled', u'utc_offset', u'statuses_count', u'description', u'friends_count', u'location', u'profile_link_color', u'profile_image_url', u'following', u'geo_enabled', u'profile_banner_url', u'profile_background_image_url', u'screen_name', u'lang', u'profile_background_tile', u'favourites_count', u'name', u'notifications', u'url', u'created_at', u'contributors_enabled', u'time_zone', u'protected', u'default_profile', u'is_translator'] http://www.bgoncalves.com/teaching/data-mining.html
  10. 10. www.bgoncalves.com@bgoncalves Anatomy of a Tweet [u'contributors', u'truncated', u'text', u'in_reply_to_status_id', u'id', u'favorite_count', u'source', u'retweeted', u'coordinates', u'entities', u'in_reply_to_screen_name', u'in_reply_to_user_id', u'retweet_count', u'id_str', u'favorited', u'user', u'geo', u'in_reply_to_user_id_str', u'possibly_sensitive', u'lang', u'created_at', u'in_reply_to_status_id_str', u'place', u'metadata'] [u'type', u'coordinates'] [u'symbols', u'user_mentions', u'hashtags', u'urls'] u'<a href="http://foursquare.com" rel=“nofollow"> foursquare</a>' u"I'm at Terminal Rodovixe1rio de Feira de Santana (Feira de Santana, BA) http://t.co/WirvdHwYMq" {u'display_url': u'4sq.com/1k5MeYF', u'expanded_url': u'http://4sq.com/1k5MeYF', u'indices': [70, 92], u'url': u'http://t.co/WirvdHwYMq'} http://www.bgoncalves.com/teaching/data-mining.html
  11. 11. Demographics
  12. 12. www.bgoncalves.com@bgoncalves Market Penetration PLoS One 8, E61981 (2013)
  13. 13. www.bgoncalves.com@bgoncalves World Coverage
  14. 14. www.bgoncalves.com@bgoncalves Age Distribution PLoS One 10, e0115545 (2015)
  15. 15. www.bgoncalves.com@bgoncalves Demographics users who we could infer a gender for, based on their name and the list previously described. We do so by comparing the first word of their self-reported name to the gender list. We observe that there exists a match for 64.2% of the users. Moreover, we find a strong bias towards male users: Fully 71.8% of the the users who we find a name match for had a male name. 0 0.2 0.4 0.6 0.8 1 2007-01 2007-07 2008-01 2008-07 2009-01 2009-07 FractionofJoiningUsers whoareMale Date Figure 3: Gender of joining users over time, binned into groups of 10,000 joining users (note that the join rate in- creases substantially). The bias towards male users is ob- served to be decreasing over time. each last name with over 100 individuals in the U.S. ing the 2000 Census, the Census releases the distributio race/ethnicity for that last name. For example, the last n “Myers” was observed to correspond to Caucasians 86% the time, African-Americans 9.7%, Asians 0.4%, and panics 1.4%. Race/ethnicity distribution of Twitter users We first determined the number of U.S.-based users whom we could infer the race/ethnicity by comparing last word of their self-reported name to the U.S. Ce last name list. We observed that we found a match 71.8% of the users. We the determined the distributio race/ethnicity in each county by taking the race/ethn distribution in the Census list, weighted by the freque of each name occurring in Twitter users in that coun Due to the large amount of ambiguity in the last name race/ethnicity list (in particular, the last name list is m than 95% predictive for only 18.5% of the users), we are able to directly compare the Twitter race/ethnicity distr 1 This is effectively the census.model approach discuss prior work (Chang et al. 2010). (a) Normal representation Figure 2: Per-county over- and underrepresentation of U.S. po tation rate of 0.324%, presented in both (a) a normal layout an Blue colors indicate underrepresentation, while red colors repre to the log of the over- or underrepresentation rate. Clear trend overrepresentation of populous counties. less than 95% predictive (e.g., the name Avery was observed to correspond to male babies only 56.8% of the time; it was Undersampling Oversampling (a) Caucasian (non-hispanic) (b) African-American (c) Asian or Pacific Islander (d) Hispanic Figure 4: Per-county area cartograms of Twitter over- and undersampling rates of Caucasian, African-American, Asian, and Hispanic users, relative to the 2000 U.S. Census. Only counties with more than 500 Twitter users with inferred race/ethnicity are shown. Blue regions correspond to undersampling; red regions to oversampling. ICWSM’11, 375 (2011)
  16. 16. www.bgoncalves.com@bgoncalves Network Structure
  17. 17. www.bgoncalves.com@bgoncalves Twitter Network TIME). The top 20 are listed in Figure 7. Some of them follow the followers, but most of them do not (the median number of follow ings of the top 40 users is 114, three orders of magnitude small than the number of followers). We revisit the issue of reciprocity Section 3.3. 3.2 Followers vs. Tweets Figure 2: The number of followers and that of tweets per use In order to gauge the correlation between the number of follow ers and that of written tweets, we plot the number of tweets (y against the number of followers a user has (x) in Figure 2. We b the number of followers in logscale and plot the median per bin the dashed line. The majority of users who have fewer than 10 fo lowers never tweeted or did just once and thus the median stays at The average number of tweets against the number of followers p ompared against each other. Before we delve into the eccen- es and peculiarities of Twitter, we run a batch of well-known sis and present the summary. Basic Analysis Figure 1: Number of followings and followers construct a directed network based on the following and fol- d and analyze its basic characteristics. Figure 1 displays the bution of the number of followings as the solid line and that of wers as the dotted line. The y-axis represents complementary lative distribution function (CCDF). We first explain the dis- nitude smaller reciprocity in eets per user ber of follow- of tweets (y) ure 2. We bin dian per bin in er than 10 fol- dian stays at 1. followers per re are outliers of followers. n x = 100 to sure, but only state the correlation between the numbers of tweets and followers. 3.3 Reciprocity In Section 3.1 we briefly mention that top users by the number of followers in Twitter are mostly celebrities and mass media and most of them do not follow their followers back. In fact Twitter shows a low level of reciprocity; 77.9% of user pairs with any link between them are connected one-way, and only 22.1% have recip- rocal relationship between them. We call those r-friends of a user as they reciprocate a user’s following. Previous studies have reported much higher reciprocity on other social networking services: 68% on Flickr [4] and 84% on Yahoo! 360 [18]. Moreover, 67.6% of users are not followed by any of their fol- lowings in Twitter. We conjecture that for these users Twitter is rather a source of information than a social networking site. Fur- ther validation is out of the scope of this paper and we leave it for future work. 3.4 Degree of Separation WWW'10, 591 (2010)
  18. 18. www.bgoncalves.com@bgoncalves Retweet Trees April 26-30 • Raleigh • NC • USA ce Size of Retweet age and median numbers of additional recipi- via retweeting be to mass media in various forms: radio, TV, and y are immediate recipients and consumers of the hed media produce. On Twitter people acquire lways directly from those they follow, but often suming a tweet posted by a user is viewed and of the user’s followers, we count the number of nts who are not immediate followers of the orig- Figure 14 displays its average and median per number of followers of the original tweet user. almost always below the average, indicating that a very large number of additional recipients. Up llowers, the average number of additional recipi- d by the number of followers of the tweet source. WWW'10, 591 (2010)
  19. 19. www.bgoncalves.com@bgoncalves Retweets Trees Figure 15: Retweet trees of ‘air france flight’ tweets Figure 16: Height and participating users in retweet trees etweeting the same tweet, and cross-retweet is retweeting each ther. In Figure 16 we plot the CCDFs of the retweet tree heights and he number of users in a retweet tree. The height of 1 is the most 6. IMPACT OF RETWEET We have seen how trending topics rise in popularity and ev ally die in Section 5. Then how exactly does information spre Twitter? Retweet is an effective means to relay the informatio yond adjacent neighbors. We dig into the retweet trees constr per trending topic and examine key factors that impact the eve spread of information. 6.1 Audience Size of Retweet WWW 2010 • Full Paper WWW’10, 591 (2010) WWW'10, 591 (2010)
  20. 20. www.bgoncalves.com@bgoncalves Link Function ICWSM’11, 89 (2011)
  21. 21. www.bgoncalves.com@bgoncalves Link Function Agreement Discussion ICWSM’11, 89 (2011)
  22. 22. Social Interactions
  23. 23. www.bgoncalves.com@bgoncalves Friends Talk to Each Other PLoS One 6, E22656 (2011)
  24. 24. www.bgoncalves.com@bgoncalves Friends Talk to Each Other PLoS One 6, E22656 (2011)
  25. 25. www.bgoncalves.com@bgoncalves Online Conversations 0 350 400 450 500 550 600 ut 0 50 100 150 200 250 300 350 400 450 500 550 600 010020030040050060050150250350450550 k in ρ B) ReciprocatedConnections 0 50 100 150 200 250 300 350 400 450 500 550 600 12345678 ωout k out A) 0 50 100 150 200 010020030040050060050150250350450550 ρ B) !out i = P j !ij kout i AverageWeightperConnection 1.7 Million users 370 Million messages Saturation of the number of reciprocated connections Number of connections for which interaction strength is highest PLoS One 6, E22656 (2011)
  26. 26. www.bgoncalves.com@bgoncalves wo possible cases in networks with : ͑a͒ positively correlated nets and ͑b͒ width of the line of the links represents CAL REVIEW E 76, 066106 ͑2007͒ A C B kin = 1 kout = 2 sin = 1 sout = 2 kin = 2 kout = 1 sin = 3 sout = 1 kin = 1 kout = 1 sin = 1 sout = 2 Figure 2: Example of a meme diffusion network involving three users mentioning and retweeting each other. The val- ues of various node statistics are shown next to each node. The strength s refers to weighted degree, k stands for degree. Observing a retweet at node B provides implicit confirma- tion that information from A appeared in B’s Twitter feed, while a mention of B originating at node A explicitly con- firms that A’s message appeared in B’s Twitter feed. This may or may not be noticed by B, therefore mention edges are less reliable indicators of information flow compared to retweet edges. Retweet and reply/mention information parsed from the text can be ambiguous, as in the case when a tweet is marked as being a ‘retweet’ of multiple people. Rather, we rely on Twitter metadata, which designates users replied to or retweeted by each message. Thus, while the text of a tweet may contain several mentions, we only draw an edge to the user explicitly designated as the mentioned user by the meta- data. In so doing, we may miss retweets that do not use the explicit retweet feature and thus are not captured in the meta- data. Note that this is separate from our use of mentions as memes (§ 3.1), which we parse from the text of the tweet. 4 System Architecture Figure 3 website, memes. detailed per day lion twe process network to produ acteristic analyses sification 4.2 M A secon The Strength of Ties
  27. 27. www.bgoncalves.com@bgoncalves Weak • Interviews to find out how individuals found out about job opportunities. • Mostly from acquaintances or friends of friends • “It is argued that the degree of overlap of two individuals social networks varies directly with the strength of their tie to one another” wo possible cases in networks with : ͑a͒ positively correlated nets and ͑b͒ width of the line of the links represents CAL REVIEW E 76, 066106 ͑2007͒ A C B kin = 1 kout = 2 sin = 1 sout = 2 kin = 2 kout = 1 sin = 3 sout = 1 kin = 1 kout = 1 sin = 1 sout = 2 Figure 2: Example of a meme diffusion network involving three users mentioning and retweeting each other. The val- ues of various node statistics are shown next to each node. The strength s refers to weighted degree, k stands for degree. Observing a retweet at node B provides implicit confirma- tion that information from A appeared in B’s Twitter feed, while a mention of B originating at node A explicitly con- firms that A’s message appeared in B’s Twitter feed. This may or may not be noticed by B, therefore mention edges are less reliable indicators of information flow compared to retweet edges. Retweet and reply/mention information parsed from the text can be ambiguous, as in the case when a tweet is marked as being a ‘retweet’ of multiple people. Rather, we rely on Twitter metadata, which designates users replied to or retweeted by each message. Thus, while the text of a tweet may contain several mentions, we only draw an edge to the user explicitly designated as the mentioned user by the meta- data. In so doing, we may miss retweets that do not use the explicit retweet feature and thus are not captured in the meta- data. Note that this is separate from our use of mentions as memes (§ 3.1), which we parse from the text of the tweet. 4 System Architecture Figure 3 website, memes. detailed per day lion twe process network to produ acteristic analyses sification 4.2 M A secon The Strength of Ties (1973)
  28. 28. www.bgoncalves.com@bgoncalves Weak • Interviews to find out how individuals found out about job opportunities. • Mostly from acquaintances or friends of friends • “It is argued that the degree of overlap of two individuals social networks varies directly with the strength of their tie to one another” wo possible cases in networks with : ͑a͒ positively correlated nets and ͑b͒ width of the line of the links represents CAL REVIEW E 76, 066106 ͑2007͒ A C B kin = 1 kout = 2 sin = 1 sout = 2 kin = 2 kout = 1 sin = 3 sout = 1 kin = 1 kout = 1 sin = 1 sout = 2 Figure 2: Example of a meme diffusion network involving three users mentioning and retweeting each other. The val- ues of various node statistics are shown next to each node. The strength s refers to weighted degree, k stands for degree. Observing a retweet at node B provides implicit confirma- tion that information from A appeared in B’s Twitter feed, while a mention of B originating at node A explicitly con- firms that A’s message appeared in B’s Twitter feed. This may or may not be noticed by B, therefore mention edges are less reliable indicators of information flow compared to retweet edges. Retweet and reply/mention information parsed from the text can be ambiguous, as in the case when a tweet is marked as being a ‘retweet’ of multiple people. Rather, we rely on Twitter metadata, which designates users replied to or retweeted by each message. Thus, while the text of a tweet may contain several mentions, we only draw an edge to the user explicitly designated as the mentioned user by the meta- data. In so doing, we may miss retweets that do not use the explicit retweet feature and thus are not captured in the meta- data. Note that this is separate from our use of mentions as memes (§ 3.1), which we parse from the text of the tweet. 4 System Architecture Figure 3 website, memes. detailed per day lion twe process network to produ acteristic analyses sification 4.2 M A secon The Strength of Ties (1973) for a time sufficient to its ale communication network nd the calls among them links. indicates a particular egocentric network evolution. In order to quantify it, we measure the probability, p(n), that the next communication event of an agent having n social ties will occur via the establishment of a new (n 1 1)th link. We calculate these probabilities in the MPC dataset averaging them for users with the same degree k at the end of the observation time. We therefore . Panels (a), and (b) show calls within 3 hours between people in the same town in two different time windows. al network structure, which was recorded by aggregating interactions during 6 months. Node size and colors idth and color represent weight.
  29. 29. www.bgoncalves.com@bgoncalves Network Structure The Strength of Intermediary Ties in Social Media “People whose networks bridge the structural holes between groups have an advantage in detecting and developing rewarding opportunities. Information arbitrage is their advantage. They are able to see early, see more broadly, and translate information across groups.” AJS Volume 110 Number 2 (September 2004): 349–99 ᭧ 2004 by The University of Chicago. All rights reserved. 0002-9602/2004/11002-0004$10.00 Structural Holes and Good Ideas1 Ronald S. Burt University of Chicago This article outlines the mechanism by which brokerage prov social capital. Opinion and behavior are more homogeneous w than between groups, so people connected across groups are m familiar with alternative ways of thinking and behaving. Broke across the structural holes between groups provides a vision o tions otherwise unseen, which is the mechanism by which broke becomes social capital. I review evidence consistent with the pothesis, then look at the networks around managers in a American electronics company. The organization is rife with s tural holes, and brokerage has its expected correlates. Compensa positive performance evaluations, promotions, and good idea disproportionately in the hands of people whose networks structural holes. The between-group brokers are more likely t press ideas, less likely to have ideas dismissed, and more like have ideas evaluated as valuable. I close with implications for ativity and structural change. The hypothesis in this article is that people who stand near the hol a social structure are at higher risk of having good ideas. The argum is that opinion and behavior are more homogeneous within than betw groups, so people connected across groups are more familiar with a 1 Portions of this material were presented as the 2003 Coleman Lecture at the Univ of Chicago, at the Harvard-MIT workshop on economic sociology, in worksho the University of California at Berkeley, the University of Chicago, the Univers Kentucky, the Russell Sage Foundation, the Stanford Graduate School of Bus the University of Texas at Dallas, Universiteit Utrecht, and the “Social Aspe Rationality” conference at the 2003 meetings of the American Sociological Associ I am grateful to Christina Hardy for her assistance on the manuscript and to se colleagues for comments affecting the final text: William Barnett, James Baron athan Bendor, Jack Birner, Matthew Bothner, Frank Dobbin, Chip Heath, R Kranton, Rakesh Khurana, Jeffrey Pfeffer, Joel Podolny, Holly Raider, James R Don Ronchi, Ezra Zuckerman, and two AJS reviewers. I am especially grate Peter Marsden for his comments as discussant at the Coleman Lecture. Direc respondence to Ron Burt, Graduate School of Business, University of Chicago cago, Illinois 60637. E-mail: ron.burt@gsb.uchicago.edu PLoS One 7, e29358 (2012)
  30. 30. www.bgoncalves.com@bgoncalves ation that the stronger the tie is the higher acts of both parties it has and the higher the belong to the same group. groups to consider is the characteristics of links ese links occur mainly between groups 200 users (Figure 4A–C). However, their he quality of the links (if they bear mentions ks with mentions are less abundant than the retweets are slightly more abundant. ngth of weak ties theory [12,14–16], weak between which they take place should be small according to the Granovetter’s theory. The results show that the most likely to attract retweets are the links connecting groups that are neither too close nor too far. This can be explained with Aral’s theory about the trade-off between diversity and bandwidth: if the two groups are too close there is no enough diversity in the information, while if the groups are too far the communication is poor. These trends are not dependant on the size of the considered groups (see Figs. S6, S7, S8, S9, S10, S11, S12, S13, S14 and Table S1 in the Supplementary Information). ink statistics. (A) Size distribution of the group. (B) Distribution of the number of groups to which each user is assigned. f different types, e.g. follower links (black bars), links with mentions (red bars) or retweets (green bars), staying in particular in respect to detected groups. .0029358.g002 Network Structure The Strength of Intermediary Ties in Social Media to Granovetter expectation that the stronger the number of mutual contacts of both parties it has a Figure 2. Group and link statistics. (A) Size distri (C) Percentage of links of different types, e.g. followe topological localizations in respect to detected grou doi:10.1371/journal.pone.0029358.g002 The PLoS One 7, e29358 (2012)
  31. 31. www.bgoncalves.com@bgoncalves Groups Figure 4. Group-group activity. (A) Distribution of the number of links in the follower network between groups as a function of the groups. (B) Fractions f of links of the different types (follower, with mentions and with retweets) as a function of the size of the group The Strength of Intermediary Ties in So PLoS One 7, e29358 (2012) 2.4 Links between groups The next question to consider is the characteristic between groups. These links occur mainly betwee containing less than 200 users (Figure 4A–C). Howe frequency depends on the quality of the links (if they bear or retweets). While links with mentions are less abundan baseline, those with retweets are slightly more According to the strength of weak ties theory [12,14– links are typically connections between persons no neighbors, being important to keep the network conn for information diffusion. We investigate whether between groups play a similar role in the online n information transmitters. The actions more related to in diffusion are retweets [24] that show a slight prefe occurring on between-group links (Figures 4B and preference is enhanced when the similarity between groups is taken into account. We define the similarity be groups, A and B, in terms of the Jaccard index connections: similarity(A,B)~ jlinks of A and Bj j|links of A and Bj : The similarity is the overlap between the groups’ connec it estimates network proximity of the groups. The gener is that links with mentions more likely occur between clo and retweets occur between groups with medium (Figure 4D). Mentions as personal messages are exchanged between users with similar environments predicted by the strength of weak ties theory. Links with are related to information transfer and the similarity of t PLoS ONE | www.plosone.org
  32. 32. Geolocation
  33. 33. www.bgoncalves.com@bgoncalves Twitter follower distance Y. Takhteyev et al. / Social Networks 34 (2012) 73–81 f physical distances between egos and alters. The graph shows the number of ties by distance, in 200 km bins (for example, New ed towards the 5400 km bin). The total number of ties in each of the two simulations is the same as in the observed data. Based on Social Networks 34, 73 (2012)
  34. 34. www.bgoncalves.com@bgoncalves Locality Y. Takhteyev et al. / Social Networks 34 (2012) 73–81 79 Table 5 Top countries. Share of egos (%)a Share of egos (%) for egos in dyadsb Share of alters (%)c Percentage of domestic tiesd Percentage of domestic ties among non-local tiesd Following foreign alters/being followed from abroad Country named explicitly (% of egos) USA 48.5 45.7 54.5 91.6 89.3 0.3 8.1 Brazil 10.6 12.1 10.5 83.5 72.5 4.9 55.4 UK 7.6 8.3 7.6 50.6 33.3 1.2 45.3 Japan 5.5 6.5 6.3 92.1 86.0 1.4 25.0 Canada 3.7 3.8 2.9 33.3 23.1 1.6 58.5 Australia 2.7 2.7 1.9 50.0 32.0 2.2 69.7 Indonesia 2.6 1.8 1.2 60.0 25.0 7.0 83.3 Germany 2.1 1.8 1.3 62.9 58.8 3.2 58.6 Netherlands 1.4 1.4 1.2 66.7 22.2 1.5 54.3 Mexico 1.2 1.3 0.7 44.0 8.3 7.0 56.7 a Out of the 2852 egos located at the level of country or better. b Out of the egos included in 1953 dyads with both parties located at the level of country or better. c Out of the 1953 alters located at the level of country or better. d The number of ties with the ego and the alter in the given country as a share of all ties for egos in that country. between those two interpretations. We also note that top Twitter clusters intersect only to an extent with Alderson and Beckfield’s (2004) ranking of world cities based on multinational corporations’ branch headquarters. (Of Alderson and Beckfield’s top 25 cities by in-degree or “prestige,” 13 appear in the top 25 Twitter clusters ranked by in-degree centrality, with another 6 appearing in top 100.) 5.3. National borders Of the ties that were matched to countries, 75 percent con- nect users in the same country. This prevalence of domestic ties is Table 6 The most common languages. Based on 2852 egos. Language % of egos English 72.5 Portuguese 10.1 Japanese 5.4 Spanish 3.1 Indonesian 1.8 German 1.7 Dutch 1.0 Chinese 0.9 Korean 0.4 Swedish 0.4 Social Networks 34, 73 (2012) Y. Takhteyev et al. / Social Networks 34 (2012) 73–81 77 accounts, by randomly drawing an account from among those “fol- lowed” by each of those egos. We then coded the locations of the alters using the same procedure as we did for the egos, removing those pairs where the alter could not be assigned to a country. In the end, we obtained a sample of 1953 ego-alter pairs with both the ego and the alter assigned to a country, including 1259 pairs with “specific” locations for both parties (Table 1). 4.4. Aggregating nearby locations Since specific locations vary substantially in precision and since users can often choose between a range of specific names for the same place (e.g., “Palo Alto” vs. “Silicon Valley” vs. “SF Bay”), we aggregated nearby locations within each country, by assigning a set of coordinates (obtained from Google Maps) to each location smaller than 25,000 km2 and then merging nearby locations within each country by replacing their coordinates with a weighted aver- age of the coordinates of the merged locations. This reduced our location descriptions to a set of 386 regional clusters, which are comparable in size to metropolitan areas. We labeled each clus- ter with the most common name associated with it in our sample. For example, the cluster centered on Manhattan is referred to as “New York.” 5. Analysis In this section we analyze the factors affecting the formation of Twitter ties. We first look at the effect of each variable identified earlier based on theoretical considerations: the actual physical dis- tance, the frequency of air travel, national boundaries, and language differences. In addition to presenting the descriptive statistics demonstrating the effects of each variable and investigating the nature of such effects, we correlated the effects using the Quadratic Assignment Procedure (QAP, Krackhardt, 1987; Butts, 2007). In the last subsection we also examined the relationship between the variables using QAP regression (Double Dekker Semi-partialling MRQAP). All statistical calculations were done using UCINet 6.277 (Borgatti et al., 2002). For correlation and regression analysis we used networks with nodes representing the 25 largest regional clusters of users (see Table 3 Top clusters. Rank Clustera Share of egos (%)b Share of egos (%) for egos in dyadsc Share of alters (%)d Localitye 1 “New York” 8.5 8.3 10.2 54.3 2 “Los Angeles, CA” 5.1 5.6 10.4 53.3 3 “ ” (Tokyo) 4.1 4.8 5.0 62.9 4 “London” 3.6 3.3 4.9 48.8 5 “São Paulo” 3.5 3.0 3.6 78.4 6 “San Francisco” 2.8 2.7 4.1 41.2 7 “New Jersey”f 2.5 2.8 2.1 20.0 8 “Chicago” 2.2 2.0 1.7 32.0 9 “Washington, DC” 2.1 2.8 2.6 34.3 10 “Manchester, UK” 1.9 2.0 1.1 30.8 11 “Atlanta” 1.7 2.1 2.1 46.2 12 “San Diego” 1.5 1.5 1.1 26.3 13 “Toronto, Canada” 1.3 1.1 1.5 42.9 14 “Seattle” 1.3 1.4 1.2 58.8 15 “Houston” 1.2 1.2 1.0 40.0 16 “Dallas, Texas” 1.2 1.0 1.4 61.5 17 “Rio de Janeiro” 1.2 1.0 1.1 30.8 18 “Boston, MA” 1.2 1.2 1.1 20.0 19 “Amsterdam” 1.1 1.1 0.9 50.0 20 “Jakarta, Indonesia” 1.1 0.6 0.3 42.9 21 “Austin, TX” 1.0 1.0 1.3 50.0 22 “Sydney” 0.9 1.0 0.8 38.5 23 “Orlando, Forida” 0.9 1.0 0.6 16.7 24 “Phoenix, AZ” 0.8 0.7 0.6 11.1 25 “ ” (Hy¯ogo)g 0.8 1.0 1.0 25.0 a Each cluster is labeled with the name most frequently used for locations assigned to the cluster. b Out of the 2167 egos located with precision of <25,000 km2 . c Out of the 1259 egos included in dyads with both parties located with precision of <25,000 km2 . d Out of the 1259 alters included in dyads with both parties located with precision of <25,000 km2 . e Defined as the share of local of ties among all ties for egos in a cluster. f Centered between Philadelphia and Trenton, NJ and includes all locations iden- tified as just “New Jersey”. g Centered near the boundary between Hy¯ogo and Osaka prefectures, in the Kansai region of Japan. over half of the egos are in other countries, as are 4 of the 10 largest clusters: Tokyo, São Paulo, and two clusters in the United
  35. 35. www.bgoncalves.com@bgoncalves Mobility and Social Networks Coupling Mobility and Interactions in Social Media Follower www.bgoncalves.combgoncalves Geography and Social Networks !"#$%& '%()&"*+,-.&$#%,( Geography Follower Reply ReTweet !"#$%&'()*+),-./*012 3&#1)40-$.&*# !"#$%&'()*#),-./*012 5#+*0 */ 6 7 6 7 Geography PLoS One 9, E92196 (2014) and for their dependence on the distance. The error Err of this null model is between 0:66–0:76 for the three countries, around twice the error of the TF model (see Figure 6). The linking model (L model) is a simplified version of the TF model, without random mobility and the box size d?0. Agents move to visit their contacts with probability pv, whereas with probability 1{pv they do not perform any action. In this version of the model, users can connect only by random connections or when two of them coincide, visiting a common friend, which leads to triadic closure. These two processes do not depend on the distances between the users. A thorough description can be obtained with a mean-field approach (see the corresponding section). The results of the L model are shown in Figure 2. Due to the triangle closing mechanism, this null model creates networks with a considerable level of clustering. However, it does not (e.g., for the US the TF model has Err lower by 0:5 and 1:5 than the TF-normal and the TF-uniform models, respectively, as shown in Figure 6). Simplified models that neglect either geography or network structure perform considerably worse than the TF model in reproducing the properties of real networks. Likewise, non-realistic assumptions on human mobility mechanism yield worse results than the default TF model. To conclude, the coupling of geography and structure through a realistic mobility mechanism produces networks with significantly more realistic geographic and structural properties. Sensitivity of the TF Model to the Parameters and its Modifications The results presented so far have been obtained at the optimal Figure 4. Simulation results: mobility and social networks. Mobility (upper row) and ego networks (lower row) of 20 random users (different colors) for the instances of the TF model yielding the lowest error Err (see Figure 3). Mobility network shows mobility patterns of individual users throughout entire simulation. Ego network shows the social connections at the end of the simulation. doi:10.1371/journal.pone.0092196.g004
  36. 36. www.bgoncalves.com@bgoncalves Geo-Social Properties PLoS One 9, E92196 (2014) Couplin that has also an edge between i and k, forming a triangle. Note a triangle consists of 3 triads centered on different nodes. effect of the distance on the clustering coefficient can incorporated by measuring the distances from each central n j to two neighbors i and k forming a triad, d~dijzdjk, calculating the network clustering restricted to triads with dist d. This new function C(d) is the probability of closing a tria given the distance d in a triad C(d)~ D(d) L(d) , where (d) and (d) are the numbers of triads and closed tr for the distance d, respectively. The value of the global cluste coefficient C can be recovered by averaging C(d) over d. In datasets, we observe a drop in C(d) followed by a plateau, whi best visible for the US networks (Figure 2E). Given a triangle, several configurations are possible if the diversity in the edge lengths. The triangle can be equilateral the edges have the same length, isosceles if two have the s length and the other is smaller, etc. We estimate the domi shapes of the triangles in the network by measuring the dispari defined as: D~6 d2 1 zd2 2 zd2 3 (d1zd2zd3)2 { 1 3 , where d1, d2 and d3 are the geographical distances between locations of the users forming the triangle. The disparity t values between 0 and 1 as the shape of the triangle passes f equilateral to isosceles, where one edge is much smaller than other two. D shows a distribution with two maxima in the on social networks (Figure 2F), for low and high values. The two m C(d). doi:10.1371/journal.pone.0092196.g002 DL eo-social properties. Various statistical network properties are plotted for the data obtained from Twitter (red squares), ), Brightkite (green triangles) and the null models (dashed lines), for the US (for the UK and Germany, see Figures S1 and S2). Coupling Mobility and Interactions in Social Media Triangle Disparity eo-social properties. Various statistical network properties are plotted for the data obtained from Twitter (red squares), ), Brightkite (green triangles) and the null models (dashed lines), for the US (for the UK and Germany, see Figures S1 and S2). enta), based on geography, matches well the data in Pl(d), but yields near-zero values for R(d), Jf (d) and C(d). The linking triadic closure, produces enough clustering, but it does not reproduce the distance dependencies of Pl(d), R(d), Jf (d) and e.0092196.g002 Coupling Mobility and Interactions in Social Media Reciprocity Figure 2. Network geo-social properties. Various statistical network properties are plotted for the data obtained from Twitter (red squares), Gowalla (blue diamonds), Brightkite (green triangles) and the null models (dashed lines), for the US (for the UK and Germany, see Figures S1 and S2). The spatial model (magenta), based on geography, matches well the data in Pl(d), but yields near-zero values for R(d), Jf (d) and C(d). The linking model (cyan), based on triadic closure, produces enough clustering, but it does not reproduce the distance dependencies of Pl(d), R(d), Jf (d) and C(d). doi:10.1371/journal.pone.0092196.g002 Coupling Mobility and Interactions in Social Media Prob of a Link ocial properties. Various statistical network properties are plotted for the data obtained from Twitter (red squares), ightkite (green triangles) and the null models (dashed lines), for the US (for the UK and Germany, see Figures S1 and S2). Coupling Mobility and Interactions in Social Media Clustering
  37. 37. www.bgoncalves.com@bgoncalves Geo-Social Model New position of u { { { Detect all encounters e in the box of u Visit a random neighbour Jump to a new location Starting position of user u Created new social links PLoS One 9, E92196 (2014)
  38. 38. www.bgoncalves.com@bgoncalves Model Fitting 0:39 for Germany. For simplicity, we focus on the Twitter networks only, although similar results are obtained for the other datasets. Results Simulations for the Optimal Parameters An example with the displacements between the consecutive locations and the ego networks for a sample of individuals, as generated by the TF model, are displayed in Figure 4. The parameters of the model are set to the ones that correspond to the minimum of the error Err. As shown, the agents tend to stay close to their original positions. Occasional long jumps occur due to friend visits that live far apart. In this range of parameters and simulation times, the main mechanism for generating long distance second null model, the linking model (L model), in contrast, is based only on random linking and triadic closure, and it is equivalent to the TF model without the mobility. We consider the two uncoupled null models and compare their results with those of the TF model. In this way, we demonstrate the importance of the coupling through a realistic mobility mechanism to reproduce the empirical networks. The spatial model (S model) consists of randomly connecting pair of users with a probability that decays as power-law of the distance between them (suggested in [41]). The exponent of the power-law is fixed at {0:7 following Figure 2A. The results of the S model are shown in the panels of Figure 2. While it is set to match Pl dð Þ, other properties such as P(k), R dð Þ, Jf dð Þ, C dð Þ or P Dð Þ are not well reproduced. The S model fails to account for the high level of clustering and reciprocity in the empirical networks Figure 3. Fitting the TF model. Values of the error Err when pv and pc are changed. The minimum error for each of the plots is marked with a red rectangle. doi:10.1371/journal.pone.0092196.g003 PLOS ONE | www.plosone.org 5 March 2014 | Volume 9 | Issue 3 | e92196 Prob. to Make a New Friend Prob.toVisitanOldFriend PLoS One 9, E92196 (2014)
  39. 39. www.bgoncalves.com@bgoncalves perties of the model networks. Various statistical properties are plotted for the networks obtained from Twitter data lation of the TF model (black line) for the US. Corresponding results for the UK and Germany can be found in Figures S3 Coupling Mobility and Interactions in Social Media al properties of the model networks. Various statistical properties are plotted for the networks obtained from Twitter data m simulation of the TF model (black line) for the US. Corresponding results for the UK and Germany can be found in Figures S3 Coupling Mobility and Interactions in Social Media Model Results Reciprocity Clustering Triangle Disparity andom connections, and so the distribution of triangles disparity prevent Figure 5. Geo-social properties of the model networks. Various statistical pro red squares) and from simulation of the TF model (black line) for the US. Correspond nd S4. doi:10.1371/journal.pone.0092196.g005 that has also an edge between i and k, forming a triangle. Note a triangle consists of 3 triads centered on different nodes. effect of the distance on the clustering coefficient can incorporated by measuring the distances from each central n j to two neighbors i and k forming a triad, d~dijzdjk, calculating the network clustering restricted to triads with dist d. This new function C(d) is the probability of closing a tria given the distance d in a triad C(d)~ D(d) L(d) , where (d) and (d) are the numbers of triads and closed tr for the distance d, respectively. The value of the global cluste coefficient C can be recovered by averaging C(d) over d. In datasets, we observe a drop in C(d) followed by a plateau, whi best visible for the US networks (Figure 2E). Given a triangle, several configurations are possible if the diversity in the edge lengths. The triangle can be equilateral the edges have the same length, isosceles if two have the s length and the other is smaller, etc. We estimate the domi shapes of the triangles in the network by measuring the dispari defined as: D~6 d2 1 zd2 2 zd2 3 (d1zd2zd3)2 { 1 3 , where d1, d2 and d3 are the geographical distances between locations of the users forming the triangle. The disparity t values between 0 and 1 as the shape of the triangle passes f equilateral to isosceles, where one edge is much smaller than other two. D shows a distribution with two maxima in the on social networks (Figure 2F), for low and high values. The two m C(d). doi:10.1371/journal.pone.0092196.g002 DL Figure 5. Geo-social properties of the model networks. Various statistical properties are plotted for the networks obtaine (red squares) and from simulation of the TF model (black line) for the US. Corresponding results for the UK and Germany can be and S4. doi:10.1371/journal.pone.0092196.g005 Coupling Mobility and Interactio s, and so the distribution of triangles disparity prevents the model from producing networks with characteristics al properties of the model networks. Various statistical properties are plotted for the networks obtained from Twitter data m simulation of the TF model (black line) for the US. Corresponding results for the UK and Germany can be found in Figures S3 one.0092196.g005 Coupling Mobility and Interactions in Social Media Prob of a Link PLoS One 9, E92196 (2014)
  40. 40. www.bgoncalves.com@bgoncalves Human Diffusion J. R. Soc. Interface 12, 20150473 (2015) Starting from Paris Starting from New York a b
  41. 41. www.bgoncalves.com@bgoncalves Human Diffusion Starting from New Yorkb J. R. Soc. Interface 12, 20150473 (2015)
  42. 42. www.bgoncalves.com@bgoncalves Residents and Tourists J. R. Soc. Interface 12, 20150473 (2015)
  43. 43. www.bgoncalves.com@bgoncalves Residents and Tourists 50 100 150 200 250 300 350 0.1 0.2 0.3 0.4 0.5 0.6 Coverage R ~ Local Non−Local a 100 200 300 400 500 600 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Proportion of Non−Local Users Coverage b 125 135 145 155 New York Chicago San Francisco Shanghai Dallas Berlin Paris Saint Petersburg Beijing Moscow Coverage c 325 335 345 Houston Barcelona Brussels Detroit Lima Istanbul Rome Moscow Paris Lisbon Coverage d J. R. Soc. Interface 12, 20150473 (2015)
  44. 44. www.bgoncalves.com@bgoncalves City Communities 0 2 4 6 8 10 Los Angeles San Francisco Miami Singapore Tokyo Paris London New York Weighted Betwennness (x 102 ) Weighted degree J. R. Soc. Interface 12, 20150473 (2015)
  45. 45. Collective Attention
  46. 46. www.bgoncalves.com@bgoncalves #tags • Metadata added to a Tweet for topic marking • Originally proposed by Chris Messina in 2007 • Quickly adopted informally by the Twitter community • Native support added by Twitter after it became popular
  47. 47. www.bgoncalves.com@bgoncalves Hashtag Statistics numberofusers tag 105 103 101 101 103 105 500 users numberoftweets tag 105 103 101 101 103 105 swsx swineflu gfail peace watchmen nsotu winnenden masters WWW’12, 251 (2012)
  48. 48. www.bgoncalves.com@bgoncalves Activity Peak Detection ! Peak: relative activity to baseline have to be 10 times larger ! Minimal level of activity expected ! Selection of isolated popularity bursts (no other peaks one week before/after) ! We detected 115 peaks continuous periodic peak #video #ff #w2e WWW’12, 251 (2012)
  49. 49. www.bgoncalves.com@bgoncalves Peak Characterization 600 1500 83% 17%69% 31% 100 % 48% 6 r cup finale 9, 2009 53% 47%700 500 300 200 0 6-3-6 3 school shooting Mar 3, 2009 34% 27% 39%600 400 200 0 6-3-6 3 movie release date Mar 6, 2009 -6 250 150 500 #winnenden #watchmen Days Tweets Before Peak After PeakPeak800 600 400 200 0 30-30 peak baseline -15 15 WWW’12, 251 (2012)
  50. 50. www.bgoncalves.com@bgoncalves Some Examples 600 400 200 0 6-3-6 3 500 300 100 0 6-3-6 3 1500 1000 500 0 6-3-6 3 83% 17%69% 31%73% 27% 1000 600 200 0 6-3-6 3 59% 41% 51% 48%600 400 200 0 6-3-6 3 master cup finale Apr 9, 2009 53% 47%700 500 300 200 0 6-3-6 3 school shooting Mar 3, 2009 34% 27% 39%600 400 200 0 6-3-6 3 movie release date Mar 6, 2009 98% 2% 0 6-3-6 3 Obama's first state of the union Feb 25, 2009 2500 1500 500 days after peak days after peakdays after peakdays after peak numberoftweetsuserID #masters #winnenden #watchmen #nsotu Anticipation Reaction “Instantaneous”“Anticipation + Reaction” WWW’12, 251 (2012)
  51. 51. www.bgoncalves.com@bgoncalves Classes of Peaks ! An#cipatory,behavior! ! Increasing,amount,of,tweets,un#l,the,event! ! Sharp,drop,of,a;en#on,aer,the,event 0% peak(fp =0) 0% before(f b=0) 0% after (fa = 0) 100% peak 100% before 100% after swineflu h1n1 sxswi easter teaparty advertising mastersnfl earthhour twestival plurkfirstfollow mrtweet cebit bsg cricket google hadopi inaug09 drupalcon coalition geekw2e humor davos watchmen job house mikeyy superbowl gfail blackout oscar snowmageddon nsotu zombies rp09 brand skittles phish ces09 socialmedia winnenden peace macheist earthday amazonfail fridayfollow aprilfools ! Unexpected,events! ! Driven,by,exogenous,sources ! Ac#vity,concentrated,on,the,peak,day! ! Events,that,only,discussed,while,,,,,,,,,, they,are,happen ! Collec#ve,a;en#on,is,built,up,to,a,,,,,,,,,,,peak, intensity,,then,a;en#on,shis,away WWW’12, 251 (2012)
  52. 52. www.bgoncalves.com@bgoncalves Barycentric Coordinates 600 400 200 0 6-3-6 3 500 300 100 0 6-3-6 3 1500 1000 500 0 6-3-6 3 83% 17%69% 31%73% 27% 1000 600 200 -3-6 59% 41% 51% 48%600 400 200 0 6-3-6 3 master cup finale Apr 9, 2009 53% 47%700 500 300 200 0 6-3-6 3 school shooting Mar 3, 2009 34% 27% 39%600 400 200 0 6-3-6 3 movie release date Mar 6, 2009 -3-6 2500 1500 500 days after peak daysdays after peakdays after peak numberoftweetsuserID #masters #winnenden #watchmen #n (0,0,1) (0,1,0) (1,0,0) (0,1/2,1/2) (1/3,1/3,1/3) (1/2,0,1/2) (1/2,1/2,0) (1/2,1/4,1/4)(1/4,1/2,1/4) (1/4,1/4,1/2) 2D-Simplex WWW’12, 251 (2012)
  53. 53. www.bgoncalves.com@bgoncalves Barycentric Coordinates 600 400 200 0 6-3-6 3 500 300 100 0 6-3-6 3 1500 1000 500 0 6-3-6 3 83% 17%69% 31%73% 27% 1000 600 200 -3-6 59% 41% 51% 48%600 400 200 0 6-3-6 3 master cup finale Apr 9, 2009 53% 47%700 500 300 200 0 6-3-6 3 school shooting Mar 3, 2009 34% 27% 39%600 400 200 0 6-3-6 3 movie release date Mar 6, 2009 -3-6 2500 1500 500 days after peak daysdays after peakdays after peak numberoftweetsuserID #masters #winnenden #watchmen #n 0% peak 0% before 0% after 100% peak 100% before 100% after swineflu h1n1 sxswi easter teaparty advertising mastersnfl earthhour twestival plurkfirstfollow mrtweet cebit bsg cricket google hadopi inaug09 drupalcon coalition geek w2e humor davos watchmen job house mikeyy superbowl gfail blackout oscar snowmageddon grammys zombies rp09 brand skittles phish ces09 socialmedia winnenden peace macheist earthday amazonfail fridayfollow aprilfools 2D-Simplex WWW’12, 251 (2012)
  54. 54. www.bgoncalves.com@bgoncalves Barycentric Coordinates 0% peak 0% before 0% after 100% peak 100% before 100% after swineflu h1n1 sxswi easter teaparty advertising mastersnfl earthhour twestival plurkfirstfollow mrtweet cebit bsg cricket google hadopi inaug09 drupalcon coalition geek w2e humor davos watchmen job house mikeyy superbowl gfail blackout oscar snowmageddon grammys zombies rp09 brand skittles phish ces09 socialmedia winnenden peace macheist earthday amazonfail fridayfollow aprilfools 600 400 200 0 6-3-6 3 500 300 100 0 6-3-6 3 1500 1000 500 0 6-3-6 3 83% 17%69% 31%73% 27% 1000 600 200 0 6-3-6 3 59% 41% 51% 48%600 400 200 0 6-3-6 3 master cup finale Apr 9, 2009 53% 47%700 500 300 200 0 6-3-6 3 school shooting Mar 3, 2009 34% 27% 39%600 400 200 0 6-3-6 3 movie release date Mar 6, 2009 98% 2% 0 6-3-6 3 Obama's first state of the union Feb 25, 2009 2500 1500 500 days after peak days after peakdays after peakdays after peak numberoftweetsuserID #masters #winnenden #watchmen #nsotu 1500 1000 500 0 6-3-6 3 83% 17% 1000 600 200 0 6-3-6 3 59% 41% 34% 27% 39%600 400 200 0 6-3-6 3 movie release date Mar 6, 2009 98% 2% 0 6-3-6 3 Obama's first state of the union Feb 25, 2009 2500 1500 500 days after peak days after peak #watchmen #nsotu WWW’12, 251 (2012)
  55. 55. www.bgoncalves.com@bgoncalves Language Matters
  56. 56. Languages
  57. 57. www.bgoncalves.com@bgoncalves Signal By Language
  58. 58. www.bgoncalves.com@bgoncalves Signal By Language Italian English Spanish Portuguese Other 76%
  59. 59. www.bgoncalves.com@bgoncalves Signal By Language Italian English Spanish Portuguese Other 16%
  60. 60. www.bgoncalves.com@bgoncalves Signal By Language Italian English Spanish Portuguese Other 2%
  61. 61. www.bgoncalves.com@bgoncalves Signal By Language Italian English Spanish Portuguese Other
  62. 62. www.bgoncalves.com@bgoncalves Spanish PLoS One 9, E112074 (2014)
  63. 63. www.bgoncalves.com@bgoncalves Local Variations PLoS One 9, E112074 (2014)
  64. 64. www.bgoncalves.com@bgoncalves Mexico City Guatemala San Salvador Caracas San Jose Panama Bogota Quito Lima Asuncion Cordoba Santiago Buenos Aires Santiago De Compostela Palma De Mallorca 0 5 10 15 20 Clusters 0.0 0.2 0.4 0.6 0.8 1.0 f(K) silhouette α β 0 2 4 6 8 Cluster N = 179 N = 956 Population(x105 ) B) A) C) Santander Oviedo Bilbao Zaragoza Valladolid Barcelona Madrid Seville San Diego Miami New York San Juan Santo Domingo Superdialects 0 0.25 0.5 0.75 1 1 2 3 4 5 6 7 8 9 10 f(K) Silhouette Mexico City Guatemala San Salvador Caracas San Jose Panama Bogota Quito Lima Asuncion Cordoba Santiago Buenos Aires Santiago De Compostela Palma De Mallorca 0 5 10 15 20 Clusters 0.0 0.2 0.4 0.6 0.8 1.0 f(K) silhouette α β 0 2 4 6 8 Cluster N = 179 N = 956 Population(x105 ) B) A) C) Santander Oviedo Bilbao Zaragoza Valladolid Barcelona Madrid Seville San Diego Miami New York San Juan Santo Domingo Mexico City Guatemala San Salvador Caracas San Jose Panama Bogota Quito Lima Asuncion Cordoba Santiago Buenos Aires Santiago De Compostela Palma De Mallorca 0 5 10 15 20 Clusters 0.0 0.2 0.4 0.6 0.8 1.0 f(K) silhouette α β 0 2 4 6 8 Cluster N = 179 N = 956 Population(x105 ) B) A) C) Santander Oviedo Bilbao Zaragoza Valladolid Barcelona Madrid Seville San Diego Miami New York San Juan Santo Domingo N = 956 N = 179 Mexico City Guatemala San Salvador Caracas San Jose Panama Bogota Quito Lima Asuncion Cordoba Santiago Buenos Aires Santiago De Compostela Palma De Mallorca 0 5 10 15 20 Clusters 0.0 0.2 0.4 0.6 0.8 1.0 f(K) silhouette α β 0 2 4 6 8 Cluster N = 179 N = 956 Population(x105 ) B) A) C) Santander Oviedo Bilbao Zaragoza Valladolid Barcelona Madrid Seville San Diego Miami New York San Juan Santo Domingo Mexico City Guatemala San Salvador Caracas San Jose Panama Bogota Quito Lima Asuncion Cordoba Santiago Buenos Aires Santiago De Compostela Palma De Mallorca 0 5 10 15 20 Clusters 0.0 0.2 0.4 0.6 0.8 1.0 f(K) silhouette α β 0 2 4 6 8 Cluster N = 179 N = 956 Population(x105 ) A) C) Santander Oviedo Bilbao Zaragoza Valladolid Barcelona Madrid Seville San Diego Miami New York San Juan Santo Domingo PLoS One 9, E112074 (2014)
  65. 65. www.bgoncalves.com@bgoncalves Regional Dialects PLoS One 9, E112074 (2014)
  66. 66. www.bgoncalves.com@bgoncalves www.bgoncalves.com@bgoncalves
  67. 67. www.bgoncalves.com@bgoncalves
  68. 68. www.bgoncalves.com@bgoncalves Bilingualism
  69. 69. www.bgoncalves.com@bgoncalves Global Language Network Twitter n Link Weight and Color t-statistic 102.59 n Slovak DanishFinnish Haitian Hebrew Galician Czech Swahili Albanian Irish Malay Estonian Maltese Romanian Lithuanian Hindi Portuguese Urdu Yiddish Vietnamese Polish Bengali Icelandic Georgian Malayalam Modern Greek Armenian Kannada Telugu Latvian Korean Burmese Thai Filipino Hungarian Central Khmer Cherokee Russian Bulgarian Welsh Amharic Belarusian Ukrainian Macedonian Italian English Arabic Serbo-Croatian Sinhala Turkish Slovenian Azerbaijani Persian German Basque Norwegian Catalan Afrikaans French Swedish Spanish Dutch Dhivehi Japanese Tibetan Panjabi Tamil Chinese Lao Gujarati ian n esian can Narom Kabyle Occitan Amharic Malagasy Pushto Moksha Udmurt Khanty Karelian Mari (Russia) Nenets Erzya Komi Abaza Northern Yukaghir Lezghian Chukot Old Russian Ossetian Tajik Tabassaran ChechenDargwa Lak AbkhazianAdyghe Nepali macrolanguage Swahili (macrolanguage) Arabic Kazakh Mongolian n Uighur Latvian anto Persian Belarusian age Family Population Link Weight and Color iatic dian nesian Caucasian Creoles pidgins Dravidian Indo-European Niger-Congo Other Sino-Tibetan Tai Uralic t-statistic co-occurrences (users, editors, translations) 102.59 min 6 6 6 twitter wikipedia book translations 994,682 49,637 183,329 max 1 billion 10 million 100 million 1 million Slovak DanishFinnish Haitian Hebrew Galician Czech Swahili Albanian Irish Malay Estonian Maltese Romanian Lithuanian Hindi Portuguese Urdu Yiddish Vietnamese Polish Bengali Icelandic English Arabic Serbo-Croatian Sinhala Slovenian Persian German Basque Norwegian Catalan Afrikaans French Swedish Spanish Dutch Ido e li Navajo Interlingua English Hindi Limburgan Javanese Pushto Vlaams Malayalam Sundanese Welsh do-Romanian Polish Venetian Aragonese Kashubian Asturian Sardinian Ligurian Friulian Guarani Italian Western Frisian Portuguese Dutch Spanish Thai JapaneseQuechua Catalan Chinese Sicilian Neapolitan Emiliano-Romagnolo Basque Malay Vietnamese Galician Afrikaans Lombard Korean Norwegian Esperanto Romanian Latin Swedish Danish Hungarian Albanian French Finnish Silesian Breton Pennsylvania German Slovak Wikipedia Language Family Pop Afro-Asiatic Altaic Amerindian Austronesian Caucasian Creoles pidgins Dravidian Indo-European Niger-Congo Other Sino-Tibetan Tai Uralic Persian Marathi Mazanderani Kashmiri Fiji Hindi OriyaSanskrit Gilaki Icelandic Swahili Scottish Gaelic Kannada Moldavian Scots Maltese Burmese Cebuano Lao Mongolian Cornish Urdu Ido Telugu Assamese Nepali Navajo Filipino Kalaallisut Interlingua Somali English Gujarati Amharic Tok Pisin Hindi Limburgan Javanese Pushto Vlaams Malayalam Sundanese Welsh Kinyarwanda Faroese Panjabi Zulu Central Khmer Old English Irish Bengali Papiamento Tamil Pampanga Macedo-Romanian Bikol Sinhala Polish Venetian Aragones Ligu Italian Western Frisian Portuguese Dutch Spanish Thai JapaneseQuechua Catalan Chinese Sicilian Neapolitan Emiliano-Romagnolo Basque Malay Vietnamese Galician Afrikaans Lombard Korean Norwegian Esperanto Romanian Latin Swedish Danish Hungarian Macedonian Low German Slovenian Yiddish Bavarian Albanian Estonian Modern Greek Romansh Azerbaijani Bulgarian Georgian Arabic Kurdish Serbo-CroatianLithuanian Köl French Czech Russian Kirghiz Finnish Tatar Yakut Armenian Hebrew Luxembourgish Ukrainian Latvian TurkishKazakh Breton Pennsylvania German Belarusian Slovak German Language Family Population Afro-Asiatic Altaic Amerindian Austronesian Caucasian Creoles pidgins Dravidian Indo-European Niger-Congo Other Sino-Tibetan Tai Uralic 1 billion 10 million 100 million 1 million Moldavian Urdu Ido Telugu Assamese Nepali Navajo Filipino Kalaallisut Interlingua Somali English Gujarati Amharic Tok Pisin Hindi Limburgan Javanese Pushto Vlaams Malayalam Sundanese Welsh Papiamento Tamil Pampanga Macedo-Romanian Bikol Sinhala Polish Venetian Aragonese Kashubian Asturian Sardinian Ligurian Friulian Guarani Italian Western Frisian Portuguese Dutch Spanish Thai JapaneseQuechua Catalan Chinese Sicilian Neapolitan Emiliano-Romagnolo Basque Malay Vietnamese Galician Afrikaans Lombard Korean Norwegian Esperanto Romanian Latin Swedish Danish Hungarian Albanian French Finnish Silesian Breton Pennsylvania German Slovak PNAS 111, E5616 (2014)
  70. 70. www.bgoncalves.com@bgoncalves Global Language Network Wikipedia Twitter Language Family Population Link Weight and C Afro-Asiatic Caucasian Niger-Congo t-statisti 2.59 1 million Finnish Galician Czec Swahili Alb Irish Malay Estonian Ma Romania Lithuanian Hindi Portuguese Urdu Yiddish Vietnamese Polish Bengali Icelandic M Modern Armenian Kannada Telugu Korean Burmese Thai Filipino Hungarian Central Khmer Cherokee English Dhivehi Japanese Tibetan Panjabi Tamil Chinese Lao Gujarati Persian Marathi Mazanderani Kashmiri Fiji Hindi OriyaSanskrit Gilaki Icelandic Swahili Scottish Gaelic Kannada Moldavian Scots Maltese Burmese Cebuano Lao Mongolian Cornish Urdu Ido Telugu Assamese Nepali Navajo Filipino Kalaallisut Interlingua Somali English Gujarati Amharic Tok Pisin Hindi Limburgan Javanese Pushto Vlaams Malayalam Sundanese Welsh Kinyarwanda Faroese Panjabi Zulu Central Khmer Old English Irish Bengali Papiamento Tamil Pampanga Macedo-Romanian Bikol Sinhala Polish Venetian Aragonese Kashubian Asturian Sardinian Ligurian Friulian Guarani Italian Western Frisian Portuguese Dutch Spanish Thai JapaneseQuechua Catalan Chinese Sicilian Neapolitan Emiliano-Romagnolo Basque Malay Vietnamese Galician Afrikaans Lombard Korean Norwegian Esperanto Romanian Latin Swedish Danish Hungarian Macedonian Low German Slovenian Yiddish Bavarian Albanian Estonian Modern Greek Romansh Azerbaijani Bulgarian Georgian Arabic Kurdish Serbo-CroatianLithuanian Kölsch French Czech Russian Kirghiz Chuvash Finnish Tatar Yakut Silesian Corsican Narom Kabyle OccitanArmenian Hebrew Luxembourgish Ukrainian Latvian TurkishKazakh Breton Pennsylvania German Belarusian Slovak German Réunion Creole French Lingala Kabyle Occitan (post 1500) Muyang Old High German (ca. 750-1050) Saramaccan Walloon Western Frisian Eastern Maroon Creole Swiss German Caribbean Javanese Sranan Tongo Karang Dogosé Kasem French Old Provençal (to 1500) Tamashek Tembo (Kitembo) Central Atlas Tamazight BudumaBambara Picard Wolof Ngiemboon Lama (Togo) Russian Amharic Malagasy Moksha Udmurt Khanty Karelian Mari (Russia) Nenets Erzya Komi Romansh Afrikaans Romanian German Nepali macrolanguage Lithuanian Swahili (macrolanguage) Arabic Kazakh Lisu Mongolian Kachin Uighur Tai Hongjin Newari Korean Latvian Hungarian Esperanto Persian Japanese Hmong Serbo-Croatian Vietnamese Belarusian HaniTibetan Dutch Lahu Sichuan Yi Azhe Chinese Church Slavic Naxi Middle Dutch (ca. 1050-1350) Wa RomanyCaribbean Hindustani Zhuang PNAS 111, E5616 (2014) 69
  71. 71. www.bgoncalves.com@bgoncalves Global Language Network Book Translations Navajo Chipewyan Ojibwa Xhosa Sindhi Filipino (macrolanguage) Kikuyu Cree Dakota Lule Sami Tavringer Romani Kurdish Swedish Northern Sami Inari Sami Finnish Egyptian (Ancient) Somali Inuktitut Cornish Hopi Haitian Syriac Kriol Classical NahuatlOld Irish (to 900) Hittite Old English (ca. 450-1100) Middle English (1100-1500) Icelandic Pahlavi Old NorseYoruba Zulu Ottoman Turkish (1500-1928) Galician Ladino Danish Norwegian Southern Sami Faroese Sumerian Kalaallisut Hawaiian Kashmiri Djeebbana Anglo-NormanPali Guianese Creole French Réunion Creole French Gascon Lingala Corsican Fulah Kabyle Occitan (post 1500) Muyang Old High German (ca. 750-1050) Saramaccan Walloon Western Frisian Eastern Maroon Creole Swiss German Caribbean Javanese Sranan Tongo Buamu Karang Dogosé Latin Ifè Italian Old French (842-ca. 1400) Middle French (ca. 1400-1600) Basque Fuliiru Portuguese Catalan Welsh Ancient Greek (to 1453) Kasem Thayore Asturian Biali Aragonese French Tepo Krumen Spanish Old Provençal (to 1500) Tamashek Tembo (Kitembo) Central Atlas Tamazight BudumaBambara Picard Cerma Breton Mofu-Gudur Wolof Ngiemboon Lama (Togo) Ngangam Quechua Kara-Kalpak Even Kalmyk Nanai Buriat Azerbaijani Kumyk Bashkir Southern Altai Tuvinian Sanskrit Lao Russian Amharic Hindi Kannada Malagasy Tamil Panjabi Evenki Karachay-Balkar Khakas Turkmen Old Japanese Gagauz Pushto Moksha Udmurt Khanty Karelian Mari (Russia) Nenets Erzya Komi Abaza Northern Yukaghir Lezghian Chukot Old Russian Ossetian Tajik Tabassaran ChechenDargwa Ingush Lak Georgian Avaric Abkhazian Kabardian Adyghe Chuvash Dolgan Crimean Tatar Yakut Tatar Kirghiz Nogai Uzbek Romansh Afrikaans Romanian Slovenian Polish German Albanian Nepali macrolanguage Lithuanian Ukrainian Slovak Central Khmer Moldavian Swahili (macrolanguage) Arabic Kazakh Lisu Mongolian Kachin Uighur Tai Hongjin Newari Korean Latvian Hungarian Esperanto Persian Japanese Hmong Serbo-Croatian Vietnamese Belarusian HaniTibetan Dutch Lahu Sichuan Yi Azhe Chinese Church Slavic Naxi Middle Dutch (ca. 1050-1350) Wa RomanyCaribbean Hindustani Zhuang Maori Modern Greek (1453-) Scots Warlpiri Coptic English Official Aramaic (700-300 BCE) Sinhala Scottish Gaelic Burmese Gujarati Assamese Bengali Malayalam Marathi Bulgarian Hausa Armenian Czech Hebrew Yiddish Urdu Malay (macrolanguage) Middle High German (ca. 1050-1500) Turkish Irish Thai Jola-Fonyi Guadeloupean Creole French Swati Macedonian Tokelau Rajasthani Telugu Maltese Middle Irish (900-1200) GeezAkkadian Estonian Oriya macrolanguage PNAS 111, E5616 (2014) 70
  72. 72. www.bgoncalves.com@bgoncalves Global Language Network numbers are 41% and 63%. In contrast, the correlation between the representation of languages in Twitter and Book Translations is 0.63 (R2 =40%), and the correlation between the strength of links is only 0.48 (R2 =23%). Finally, we note that—with respect to the book translation dataset—the two digital datasets (Twitter and Wikipedia) are overexpressed in languages associated with developing countries, like Malay, Filipino and Swahili. This indicates that these digital media are more inclusive of the populations of developing countries than written books. PNAS 111, E5616 (2014) 71
  73. 73. www.bgoncalves.com@bgoncalves Language and Fame afrafr araara azaze belbel benben bulbul catcat cesces dandan deudeu ellell eng estest euseus fasfasfilfil finfin frfra gujguj hbshbs hebheb hinhin hunhun hyehye islisl itaita jpnjpn kankan katkat khmkhm korkor lalav litlit malmalmkdmkd mlmlt msamsa mymya nldnld nornor panpan polpol porpor ronron rusrus sisin slkslk slslv spaspa sqisqiswswa sweswe tamtamteltelthatha turtur ukrukr urdurd vivie zhozho R² = 0.693 p-value 0.001 C araara benben cat ces dandan deu ell eng fin fra glglg hin hun ita jpn nld norpol ron rus slk slslv spa swe teltel turtur zho $10k $20k $30k $40k $50k $0k GDP per Capita R² = 0.858 p-value 0.001 F log10 (HAfamouspeople) log10 (Twitter Eigenvector Centrality) 0 1 2 3 −6 −4 −2 0 −6 −4 −2 0 −6 −4 −2 0 1 2 3 0 log10 (Wikipedia Eigenvector Centrality) log10 (Book Trans. Eigenvector Cent.) $10k $20k $30k $40k $50k $0k GDP per Capita Number of speakers 400 M 1200 M 800 M afrafr ara azeaze belbel benben bulbul catcat cesces dandan deudeu ellell eng estest euseus fasfasfilfil finfin frafra gujguj hbshbs hebheb hihin hun hyehye isisl itaita jpn kankan katkat khmkhm korkor lavlav litlit malmal mkdmkdmlmlt msamsa mymya nld nornor panpan polpol por ronron rusrus sinsin slkslk slslv spaspa sqisqiswswa swe tamtam glgglg thatha ukrukr urdurd vievie zhozho R² = 0.755 p-value 0.001 B afr ara azaze belbel benben bubul cat cesdan deu ell eng estest eus fafas fil fin fra glglg gujguj hbs hebheb hin hun hye isisl ita jpn kankan kat khmkhm kor lav litlit malmalmkdmkd mlt msa mymya nld nor pan pol por ronron rus sisin slslk slslv spa sqi swa swe tammtel tha turukr urd vivie zho R² = 0.447 p-value 0.001 A $10k $20k $30k $40k $50k GDP per Capita $0 ara ben cat ces dan deu ell eng fin fra glg hbs hin hun ita jpn nld norpol por ron rus slk slv spa swe tel tur zho R² = 0.399 p-value 0.001 D arara benben catcat cesces dandan deudeu elell engeng finfin frfra glglg hbshbs hihin hunhun itaitajpnjpn nldnld nornor polpol porpor roron rurus slslk slslv spaspa sweswe tetel tutur zhozho R² = 0.758 p-value 0.001 E hbs por glgglg turtur teltel log10 (Wikipedia26+famouspeople) Fig. 3. The position of a language in the GLN and the global impact of its speakers. Top row shows the number of people per language (born 1800–1950) with articles in at least 26 Wikipedia language editions as a function of their language’s eigenvector centrality in the (A) Twitter GLN, (B) Wikipedia GLN, and (C) book translation GLN. The bottom row shows the number of people per language (born 1800–1950) listed in Human Accomplishment as a function of their language’s eigenvector centrality in (D) Twitter GLN, (E) Wikipedia GLN, and (F) book translation GLN. Size represents the number of speakers for each PNAS 111, E5616 (2014) 72
  74. 74. Predictions
  75. 75. www.bgoncalves.com@bgoncalves Collective Attention “Prediction is very difficult, especially about the future.” (Niels Bohr)
  76. 76. www.bgoncalves.com@bgoncalves Even more so in Political Elections http://truthy.indiana.edu A B C)C D E)E F G)G H #ampat @PeaceKaren_25 gopleader.gov “How Chris Coons budget works- uses tax $ 2 attend dinners and fashion shows”
  77. 77. www.bgoncalves.com@bgoncalves Even more so in Political Elections http://truthy.indiana.edu A B C)C D E)E F G)G H #ampat @PeaceKaren_25 gopleader.gov “How Chris Coons budget works- uses tax $ 2 attend dinners and fashion shows” Table 1: Features used in truthy classification. nodes Number of nodes edges Number of edges mean k Mean degree mean s Mean strength mean w Mean edge weight in largest con- nected component max k(i,o) Maximum (in,out)-degree max k(i,o) user User with max. (in,out)-degree max s(i,o) Maximum (in,out)-strength max s(i,o) user User with max. (in,out)-strength std k(i,o) Std. dev. of (in,out)-degree std s(i,o) Std. dev. of (in,out)-strength skew k(i,o) Skew of (in,out)-degree distribution skew s(i,o) Skew of (in,out)-strength distribution mean cc Mean size of connected components max cc Size of largest connected component entry nodes Number of unique injections num truthy Number of times ‘truthy’ button was clicked sentiment scores Six GPOMS sentiment dimensions graph. These include the number of nodes and edges in the graph, the mean degree and strength of nodes in the graph, mean edge weight, mean clustering coefficient across nodes in the largest connected component, and the standard devi- ation and skew of each network’s in-degree, out-degree and strength distributions (see Fig. 2). Additionally we track the out-degree and out-strength of the most prolific broadcaster, as well as the in-degree and in-strength of the most focused- upon user. We also monitor the number of unique injection points of the meme, reasoning that organic memes (such as those relating to news events) will be associated with larger number of originating users. 4.4 Sentiment Analysis We also utilize a modified version of the Google-based Profile of Mood States (GPOMS) sentiment analysis method (Bollen, Mao, and Pepe 2010) in the analysis of meme-specific sentiment on Twitter. The GPOMS tool as- Table 2: Performance of two classifiers with and without re- sampling training data to equalize class sizes. All results are averaged based on 10-fold cross-validation. Classifier Resampling? Accuracy AUC AdaBoost No 92.6% 0.91 AdaBoost Yes 96.4% 0.99 SVM No 88.3% 0.77 SVM Yes 95.6% 0.95 Table 3: Confusion matrices for a boosted decision stump classifier with and without resampling. The labels on the rows refer to true class assignments; the labels on the columns are those predicted. No resampling With resampling Truthy Legitimate Truthy Legitimate T 45 (12%) 16 (4%) 165 (45%) 6 (1%) L 11 (3%) 294 (80%) 7 (2%) 188 (51%) additional volunteers), and asking them to place each meme in one of the three categories. A meme was to be classified as ‘truthy’ if a significant portion of the users involved in that meme appeared to be spreading it in misleading ways — e.g., if a number of the accounts tweeting about the meme appeared to be robots or sock puppets, the accounts appeared to follow only other propagators of the meme (clique behav- ior), or the users engaged in repeated reply/retweet exclu- sively with other users who had tweeted the meme. ‘Legit- imate’ memes were described as memes representing nor- mal use of Twitter — several non-automated users convers- ing about a topic. The final category, ‘remove,’ was used for memes in a non-English language or otherwise unrelated to U.S. politics (#youth, for example). These memes were not used in the training or evaluation of classifiers. Upon gathering 252 annotated memes, we observed an imbalance in our labeled data (231 legitimate and only 21 truthy). Rather than simply resampling from the smaller class, as is common practice in the case of class imbal- eatures used in truthy classification. des Number of nodes ges Number of edges n k Mean degree n s Mean strength n w Mean edge weight in largest con- nected component ,o) Maximum (in,out)-degree ser User with max. (in,out)-degree ,o) Maximum (in,out)-strength ser User with max. (in,out)-strength ,o) Std. dev. of (in,out)-degree ,o) Std. dev. of (in,out)-strength ,o) Skew of (in,out)-degree distribution ,o) Skew of (in,out)-strength distribution cc Mean size of connected components cc Size of largest connected component des Number of unique injections thy Number of times ‘truthy’ button was clicked ores Six GPOMS sentiment dimensions ude the number of nodes and edges in the degree and strength of nodes in the graph, t, mean clustering coefficient across nodes nected component, and the standard devi- f each network’s in-degree, out-degree and ions (see Fig. 2). Additionally we track the Table 2: Performance of two classifiers with and without re- sampling training data to equalize class sizes. All results are averaged based on 10-fold cross-validation. Classifier Resampling? Accuracy AUC AdaBoost No 92.6% 0.91 AdaBoost Yes 96.4% 0.99 SVM No 88.3% 0.77 SVM Yes 95.6% 0.95 Table 3: Confusion matrices for a boosted decision stump classifier with and without resampling. The labels on the rows refer to true class assignments; the labels on the columns are those predicted. No resampling With resampling Truthy Legitimate Truthy Legitimate T 45 (12%) 16 (4%) 165 (45%) 6 (1%) L 11 (3%) 294 (80%) 7 (2%) 188 (51%) additional volunteers), and asking them to place each meme in one of the three categories. A meme was to be classified as ‘truthy’ if a significant portion of the users involved in that meme appeared to be spreading it in misleading ways — e.g., if a number of the accounts tweeting about the meme appeared to be robots or sock puppets, the accounts appeared to follow only other propagators of the meme (clique behav-
  78. 78. www.bgoncalves.com@bgoncalves Even more so in Political Elections http://truthy.indiana.edu A B C)C D E)E F G)G H #ampat @PeaceKaren_25 gopleader.gov “How Chris Coons budget works- uses tax $ 2 attend dinners and fashion shows” Table 1: Features used in truthy classification. nodes Number of nodes edges Number of edges mean k Mean degree mean s Mean strength mean w Mean edge weight in largest con- nected component max k(i,o) Maximum (in,out)-degree max k(i,o) user User with max. (in,out)-degree max s(i,o) Maximum (in,out)-strength max s(i,o) user User with max. (in,out)-strength std k(i,o) Std. dev. of (in,out)-degree std s(i,o) Std. dev. of (in,out)-strength skew k(i,o) Skew of (in,out)-degree distribution skew s(i,o) Skew of (in,out)-strength distribution mean cc Mean size of connected components max cc Size of largest connected component entry nodes Number of unique injections num truthy Number of times ‘truthy’ button was clicked sentiment scores Six GPOMS sentiment dimensions graph. These include the number of nodes and edges in the graph, the mean degree and strength of nodes in the graph, mean edge weight, mean clustering coefficient across nodes in the largest connected component, and the standard devi- ation and skew of each network’s in-degree, out-degree and strength distributions (see Fig. 2). Additionally we track the out-degree and out-strength of the most prolific broadcaster, as well as the in-degree and in-strength of the most focused- upon user. We also monitor the number of unique injection points of the meme, reasoning that organic memes (such as those relating to news events) will be associated with larger number of originating users. 4.4 Sentiment Analysis We also utilize a modified version of the Google-based Profile of Mood States (GPOMS) sentiment analysis method (Bollen, Mao, and Pepe 2010) in the analysis of meme-specific sentiment on Twitter. The GPOMS tool as- Table 2: Performance of two classifiers with and without re- sampling training data to equalize class sizes. All results are averaged based on 10-fold cross-validation. Classifier Resampling? Accuracy AUC AdaBoost No 92.6% 0.91 AdaBoost Yes 96.4% 0.99 SVM No 88.3% 0.77 SVM Yes 95.6% 0.95 Table 3: Confusion matrices for a boosted decision stump classifier with and without resampling. The labels on the rows refer to true class assignments; the labels on the columns are those predicted. No resampling With resampling Truthy Legitimate Truthy Legitimate T 45 (12%) 16 (4%) 165 (45%) 6 (1%) L 11 (3%) 294 (80%) 7 (2%) 188 (51%) additional volunteers), and asking them to place each meme in one of the three categories. A meme was to be classified as ‘truthy’ if a significant portion of the users involved in that meme appeared to be spreading it in misleading ways — e.g., if a number of the accounts tweeting about the meme appeared to be robots or sock puppets, the accounts appeared to follow only other propagators of the meme (clique behav- ior), or the users engaged in repeated reply/retweet exclu- sively with other users who had tweeted the meme. ‘Legit- imate’ memes were described as memes representing nor- mal use of Twitter — several non-automated users convers- ing about a topic. The final category, ‘remove,’ was used for memes in a non-English language or otherwise unrelated to U.S. politics (#youth, for example). These memes were not used in the training or evaluation of classifiers. Upon gathering 252 annotated memes, we observed an imbalance in our labeled data (231 legitimate and only 21 truthy). Rather than simply resampling from the smaller class, as is common practice in the case of class imbal- eatures used in truthy classification. des Number of nodes ges Number of edges n k Mean degree n s Mean strength n w Mean edge weight in largest con- nected component ,o) Maximum (in,out)-degree ser User with max. (in,out)-degree ,o) Maximum (in,out)-strength ser User with max. (in,out)-strength ,o) Std. dev. of (in,out)-degree ,o) Std. dev. of (in,out)-strength ,o) Skew of (in,out)-degree distribution ,o) Skew of (in,out)-strength distribution cc Mean size of connected components cc Size of largest connected component des Number of unique injections thy Number of times ‘truthy’ button was clicked ores Six GPOMS sentiment dimensions ude the number of nodes and edges in the degree and strength of nodes in the graph, t, mean clustering coefficient across nodes nected component, and the standard devi- f each network’s in-degree, out-degree and ions (see Fig. 2). Additionally we track the Table 2: Performance of two classifiers with and without re- sampling training data to equalize class sizes. All results are averaged based on 10-fold cross-validation. Classifier Resampling? Accuracy AUC AdaBoost No 92.6% 0.91 AdaBoost Yes 96.4% 0.99 SVM No 88.3% 0.77 SVM Yes 95.6% 0.95 Table 3: Confusion matrices for a boosted decision stump classifier with and without resampling. The labels on the rows refer to true class assignments; the labels on the columns are those predicted. No resampling With resampling Truthy Legitimate Truthy Legitimate T 45 (12%) 16 (4%) 165 (45%) 6 (1%) L 11 (3%) 294 (80%) 7 (2%) 188 (51%) additional volunteers), and asking them to place each meme in one of the three categories. A meme was to be classified as ‘truthy’ if a significant portion of the users involved in that meme appeared to be spreading it in misleading ways — e.g., if a number of the accounts tweeting about the meme appeared to be robots or sock puppets, the accounts appeared to follow only other propagators of the meme (clique behav-
  79. 79. www.bgoncalves.com@bgoncalves Even more so in Political Elections http://truthy.indiana.edu A B C)C D E)E F G)G H #ampat @PeaceKaren_25 gopleader.gov “How Chris Coons budget works- uses tax $ 2 attend dinners and fashion shows” Table 1: Features used in truthy classification. nodes Number of nodes edges Number of edges mean k Mean degree mean s Mean strength mean w Mean edge weight in largest con- nected component max k(i,o) Maximum (in,out)-degree max k(i,o) user User with max. (in,out)-degree max s(i,o) Maximum (in,out)-strength max s(i,o) user User with max. (in,out)-strength std k(i,o) Std. dev. of (in,out)-degree std s(i,o) Std. dev. of (in,out)-strength skew k(i,o) Skew of (in,out)-degree distribution skew s(i,o) Skew of (in,out)-strength distribution mean cc Mean size of connected components max cc Size of largest connected component entry nodes Number of unique injections num truthy Number of times ‘truthy’ button was clicked sentiment scores Six GPOMS sentiment dimensions graph. These include the number of nodes and edges in the graph, the mean degree and strength of nodes in the graph, mean edge weight, mean clustering coefficient across nodes in the largest connected component, and the standard devi- ation and skew of each network’s in-degree, out-degree and strength distributions (see Fig. 2). Additionally we track the out-degree and out-strength of the most prolific broadcaster, as well as the in-degree and in-strength of the most focused- upon user. We also monitor the number of unique injection points of the meme, reasoning that organic memes (such as those relating to news events) will be associated with larger number of originating users. 4.4 Sentiment Analysis We also utilize a modified version of the Google-based Profile of Mood States (GPOMS) sentiment analysis method (Bollen, Mao, and Pepe 2010) in the analysis of meme-specific sentiment on Twitter. The GPOMS tool as- Table 2: Performance of two classifiers with and without re- sampling training data to equalize class sizes. All results are averaged based on 10-fold cross-validation. Classifier Resampling? Accuracy AUC AdaBoost No 92.6% 0.91 AdaBoost Yes 96.4% 0.99 SVM No 88.3% 0.77 SVM Yes 95.6% 0.95 Table 3: Confusion matrices for a boosted decision stump classifier with and without resampling. The labels on the rows refer to true class assignments; the labels on the columns are those predicted. No resampling With resampling Truthy Legitimate Truthy Legitimate T 45 (12%) 16 (4%) 165 (45%) 6 (1%) L 11 (3%) 294 (80%) 7 (2%) 188 (51%) additional volunteers), and asking them to place each meme in one of the three categories. A meme was to be classified as ‘truthy’ if a significant portion of the users involved in that meme appeared to be spreading it in misleading ways — e.g., if a number of the accounts tweeting about the meme appeared to be robots or sock puppets, the accounts appeared to follow only other propagators of the meme (clique behav- ior), or the users engaged in repeated reply/retweet exclu- sively with other users who had tweeted the meme. ‘Legit- imate’ memes were described as memes representing nor- mal use of Twitter — several non-automated users convers- ing about a topic. The final category, ‘remove,’ was used for memes in a non-English language or otherwise unrelated to U.S. politics (#youth, for example). These memes were not used in the training or evaluation of classifiers. Upon gathering 252 annotated memes, we observed an imbalance in our labeled data (231 legitimate and only 21 truthy). Rather than simply resampling from the smaller class, as is common practice in the case of class imbal- eatures used in truthy classification. des Number of nodes ges Number of edges n k Mean degree n s Mean strength n w Mean edge weight in largest con- nected component ,o) Maximum (in,out)-degree ser User with max. (in,out)-degree ,o) Maximum (in,out)-strength ser User with max. (in,out)-strength ,o) Std. dev. of (in,out)-degree ,o) Std. dev. of (in,out)-strength ,o) Skew of (in,out)-degree distribution ,o) Skew of (in,out)-strength distribution cc Mean size of connected components cc Size of largest connected component des Number of unique injections thy Number of times ‘truthy’ button was clicked ores Six GPOMS sentiment dimensions ude the number of nodes and edges in the degree and strength of nodes in the graph, t, mean clustering coefficient across nodes nected component, and the standard devi- f each network’s in-degree, out-degree and ions (see Fig. 2). Additionally we track the Table 2: Performance of two classifiers with and without re- sampling training data to equalize class sizes. All results are averaged based on 10-fold cross-validation. Classifier Resampling? Accuracy AUC AdaBoost No 92.6% 0.91 AdaBoost Yes 96.4% 0.99 SVM No 88.3% 0.77 SVM Yes 95.6% 0.95 Table 3: Confusion matrices for a boosted decision stump classifier with and without resampling. The labels on the rows refer to true class assignments; the labels on the columns are those predicted. No resampling With resampling Truthy Legitimate Truthy Legitimate T 45 (12%) 16 (4%) 165 (45%) 6 (1%) L 11 (3%) 294 (80%) 7 (2%) 188 (51%) additional volunteers), and asking them to place each meme in one of the three categories. A meme was to be classified as ‘truthy’ if a significant portion of the users involved in that meme appeared to be spreading it in misleading ways — e.g., if a number of the accounts tweeting about the meme appeared to be robots or sock puppets, the accounts appeared to follow only other propagators of the meme (clique behav- Why not start with something a bit simpler?
  80. 80. www.bgoncalves.com@bgoncalves American Idol • Popularity contest • Well defined audience, across the entire US • Similar demographics voting and tweeting • Weekly “votes”, involving the same population • Immediate results • (Almost) No incentives for organized campaigns
  81. 81. www.bgoncalves.com@bgoncalves Skylar 10 20 30 40 50 % of Tweets 60 700 Calibration Jessica Phillip Joshua Hollie Skylar Top 5 EPJ Data Science 1, 8 (2012)
  82. 82. www.bgoncalves.com@bgoncalves Skylar 10 20 30 40 50 % of Tweets 60 700 Calibration Jessica Phillip Joshua Hollie Top 4 EPJ Data Science 1, 8 (2012)
  83. 83. www.bgoncalves.com@bgoncalves Skylar 10 20 30 40 50 % of Tweets 60 700 Calibration Jessica Phillip Joshua Top 3 EPJ Data Science 1, 8 (2012)
  84. 84. www.bgoncalves.com@bgoncalves Geographic Locations T (B) (C) Jessica Phillip Joshua Hollie Skylar CC Top 4 (A) (B) (C) Top 3 (B) (C) Jessica Phillip Joshua Hollie Skylar CC Top 5 EPJ Data Science 1, 8 (2012)
  85. 85. www.bgoncalves.com@bgoncalves An actual prediction EPJ Data Science 1, 8 (2012)
  86. 86. www.bgoncalves.com@bgoncalves And the winner is... Jessica Phillip World U.S. Phillip Phillip U.S. Jessica Phillip 10 20 30 40 50 % of Tweets 60 700 80 EPJ Data Science 1, 8 (2012)
  87. 87. www.bgoncalves.com@bgoncalves And the winner is... Phillip U.S. Jessica Phillip 10 20 30 40 50 % of Tweets 60 700 80 Phillip U.S. Jessica Phillip 10 20 30 40 50 % of Tweets 60 700 80 EPJ Data Science 1, 8 (2012)
  88. 88. www.bgoncalves.com@bgoncalves And the winner is... EPJ Data Science 1, 8 (2012)
  89. 89. www.bgoncalves.com@bgoncalves Stock Market 2 ecember 19, text content a positive The second of tweets to ublic mood ublic along lting public s Industrial changes in e prediction els is signif- re included, ublic mood by GPOMS appiness as Twitter feed ~ (1) OpinionFinder (2) G-POMS (6 dim.) Mood indicators (daily) DJIA ~ Stock market (daily) (3) DJIA Granger causality -n (lag) F-statistic p-value text analysis normalization SOFNN predicted value MAPE Direction % 1 2 t-1 t-2 t-3 3 t=0 value feb28 2008 apr may jun jul aug sep oct nov dec dec20 2008 (1) OF ~ GPOMS (2) Granger Causality analysis (3) SOFNN training test Methodology Data sets and timeline Fig. 1. Diagram outlining 3 phases of methodology and corresponding data sets: (1) creation and validation of OpinionFinder and GPOMS public mood
  90. 90. www.bgoncalves.com@bgoncalves POMS • Simple questionnaire that classifies a person’s mood along 6 dimensions: • tension-anxiety • depression-dejection • anger-hostility • fatigue-inertia • vigor-activity • confusion-bewilderment • How to administer it to Twitter users? • Expand vocabulary using Google n-grams • Search twitter for matching words Profile of Mood States Subject's Initials Birth date Date Subject Code No. Directions: Describe HOW YOU FEEL RIGHT NOW by circling the most appropriate number after each of the words listed below: Quite a FEELING Not at all A little Moderate bit Extremely 1. Friendly 1 2 3 4 5 2. Tense 1 2 3 4 5 3. Angry 1 2 3 4 5 4. Worn Out 1 2 3 4 5 5. Unhappy 1 2 3 4 5 6. Clear-headed 1 2 3 4 5 7. Lively 1 2 3 4 5 8. Confused 1 2 3 4 5 9. Sorry for things done 1 2 3 4 5 10. Shaky 1 2 3 4 5 11. Listless 1 2 3 4 5 12. Peeved 1 2 3 4 5 13. Considerate 1 2 3 4 5 14. Sad 1 2 3 4 5 15. Active 1 2 3 4 5 16. On edge 1 2 3 4 5 17. Grouchy 1 2 3 4 5 18. Blue 1 2 3 4 5 19. Energetic 1 2 3 4 5 20. Panicky 1 2 3 4 5 21. Hopeless 1 2 3 4 5 22. Relaxed 1 2 3 4 5 23. Unworthy 1 2 3 4 5
  91. 91. www.bgoncalves.com@bgoncalves Timelines along each mood dimension ounterpart to the differentiated response to the Presidential lection. On Thanksgiving day we find a spike in Happy values, indicating high levels of public happiness. However, no other mood dimensions are elevated on November 27. Furthermore, the spike in Happy values is limited to the one day, i.e. we find no significant mood response the day before or after Thanksgiving. 1.25 1.75 OpinionFinder day after election Thanksgiving -1 1 pre- election anxiety CALM -1 1 ALERT -1 1 election results SURE 1 1 pre! election energy VITAL -1 -1 KIND -1 1 Thanksgiving happiness HAPPY Oct 22 Oct 29 Nov 05 Nov 12 Nov 19 Nov 26 z-scores ig. 2. Tracking public mood states from tweets posted between October 008 to December 2008 shows public responses to presidential election and hanksgiving. rtially overlap with the mood values provided by r, but not necessarily all mood dimensions that ortant in describing the various components of e.g. the varied mood response to the Presidential GPOMS thus provides a unique perspective on states not captured by uni-dimensional tools such nder. Granger Causality Analysis of Mood vs. DJIA blishing that our mood time series responds to cio-cultural events such as the Presidential elec- nksgiving, we are concerned with the question r variations of the public’s mood state correlate in the stock market, in particular DJIA closing nswer this question, we apply the econometric Granger causality analysis to the daily time ed by GPOMS and OpinionFinder vs. the DJIA. ality analysis rests on the assumption that if a auses Y then changes in X will systematically changes in Y . We will thus find that the lagged will exhibit a statistically significant correlation elation however does not prove causation. We Granger causality analysis in a similar fashion re not testing actual causation but whether one as predictive information about the other or not7 . ime series, denoted Dt, is defined to reflect daily tock market value, i.e. its values are the delta high level of confidence. However, this result only applies to 1 GPOMS mood dimension. We observe that X1 (i.e. Calm) has the highest Granger causality relation with DJIA for lags ranging from 2 to 6 days (p-values 0.05). The other four mood dimensions of GPOMS do not have significant causal relations with changes in the stock market, and neither does the OpinionFinder time series. To visualize the correlation between X1 and the DJIA in more detail, we plot both time series in Fig. 3. To maintain the same scale, we convert the DJIA delta values Dt and mood index value Xt to z-scores as shown in Eq. 1. -2 -1 0 1 2 DJIAz-score Aug 09 Aug 29 Sep 18 Oct 08 Oct 28 -2 -1 0 1 2 -2 -1 0 1 2 -2 -1 0 1 2 DJIAz-scoreCalmz-score Calmz-score bank bail-out Fig. 3. A panel of three graphs. The top graph shows the overlap of the day-to-day difference of DJIA values (blue: ZDt ) with the GPOMS’ Calm Look for correlations between dimensions and DJIA 1 Twitter mood predicts the stock market. Johan Bollen1,?,Huina Mao1,?,Xiao-Jun Zeng2. ?: authors made equal contributions. Abstract—Behavioral economics tells us that emotions can profoundly affect individual behavior and decision-making. Does this also apply to societies at large, i.e. can societies experience mood states that affect their collective decision making? By extension is the public mood correlated or even predictive of economic indicators? Here we investigate whether measurements of collective mood states derived from large-scale Twitter feeds are correlated to the value of the Dow Jones Industrial Average (DJIA) over time. We analyze the text content of daily Twitter feeds by two mood tracking tools, namely OpinionFinder that measures positive vs. negative mood and Google-Profile of Mood States (GPOMS) that measures mood in terms of 6 dimensions (Calm, Alert, Sure, Vital, Kind, and Happy). We cross-validate the resulting mood time series by comparing their ability to detect the public’s response to the presidential election and Thanksgiving day in 2008. A Granger causality analysis and a Self-Organizing Fuzzy Neural Network are then used to investigate the hypothesis that public mood states, as measured by the OpinionFinder and GPOMS mood time series, are predictive of changes in DJIA closing values. Our results indicate that the accuracy of DJIA predictions can be significantly improved by the inclusion of specific public mood dimensions but not others. We find an accuracy of 87.6% in predicting the daily up and down changes in the closing values of the DJIA and a reduction of the Mean Average Percentage Error by more than 6%. Index Terms—stock market prediction — twitter — mood analysis. I. INTRODUCTION STOCK market prediction has attracted much attention from academia as well as business. But can the stock market really be predicted? Early research on stock market prediction [1], [2], [3] was based on random walk theory and the Efficient Market Hypothesis (EMH) [4]. According to the EMH stock market prices are largely driven by new information, i.e. news, rather than present and past prices. Since news is unpredictable, stock market prices will follow a random walk pattern and cannot be predicted with more than 50 percent accuracy [5]. There are two problems with EMH. First, numerous studies show that stock market prices do not follow a random walk and can indeed to some degree be predicted [5], [6], [7], [8] thereby calling into question EMH’s basic assumptions. Sec- ond, recent research suggests that news may be unpredictable but that very early indicators can be extracted from online social media (blogs, Twitter feeds, etc) to predict changes in various economic and commercial indicators. This may conceivably also be the case for the stock market. For example, [11] shows how online chat activity predicts book sales. [12] uses assessments of blog sentiment to predict movie sales. sentiment from blogs. In addition, Google search queries have been shown to provide early indicators of disease infection rates and consumer spending [14]. [9] investigates the relations between breaking financial news and stock price changes. Most recently [13] provide a ground-breaking demonstration of how public sentiment related to movies, as expressed on Twitter, can actually predict box office receipts. Although news most certainly influences stock market prices, public mood states or sentiment may play an equally important role. We know from psychological research that emotions, in addition to information, play an significant role in human decision-making [16], [18], [39]. Behavioral finance has provided further proof that financial decisions are sig- nificantly driven by emotion and mood [19]. It is therefore reasonable to assume that the public mood and sentiment can drive stock market values as much as news. This is supported by recent research by [10] who extract an indicator of public anxiety from LiveJournal posts and investigate whether its variations can predict SP500 values. However, if it is our goal to study how public mood influences the stock markets, we need reliable, scalable and early assessments of the public mood at a time-scale and resolution appropriate for practical stock market prediction. Large surveys of public mood over representative samples of the population are generally expensive and time-consuming to conduct, cf. Gallup’s opinion polls and various consumer and well-being indices. Some have therefore proposed indirect assessment of public mood or sentiment from the results of soccer games [20] and from weather conditions [21]. The accuracy of these methods is however limited by the low degree to which the chosen indicators are expected to be correlated with public mood. Over the past 5 years significant progress has been made in sentiment tracking techniques that extract indicators of public mood directly from social media content such as blog content [10], [12], [15], [17] and in particular large-scale Twitter feeds [22]. Although each so-called tweet, i.e. an individual user post, is limited to only 140 characters, the aggregate of millions of tweets submitted to Twitter at any given time may provide an accurate representation of public mood and sentiment. This has led to the development of real- time sentiment-tracking indicators such as [17] and “Pulse of Nation”1 . In this paper we investigate whether public sentiment, as expressed in large-scale collections of daily Twitter posts, can be used to predict the stock market. We use two tools to measure variations in the public mood from tweets submitted arXiv:1010.3003v1[cs.CE]14Oct2010
  92. 92. www.bgoncalves.com@bgoncalves And it works!
  93. 93. www.bgoncalves.com@bgoncalves And it works! (Maybe!)
  94. 94. www.bgoncalves.com@bgoncalves And it works! (Maybe!)
  95. 95. www.bgoncalves.com@bgoncalves Coming Soon! CompleNet 2016 Dijon, France — March 23-25
  96. 96. www.bgoncalves.com@bgoncalves Coming Soon! CompleNet 2016 Dijon, France — March 23-25

×