Who says what to whom on Twitter? - Twitter flow


Published on

Article on Nieman Journalism Lab: http://bit.ly/gVRSAc
Who says what to whom on Twitter?
Four scientists studied the production, flow, and consumption of information on Twitter.... they found a dramatic concentration of attention: 50 percent of tweets consumed are generated by just 20,000 elite users. Or, as the authors note later in the study: “Roughly 0.05% of the population accounts for almost half of all attention”. But there are two other key takeaways in the study worth highlighting, particularly for news organizations trying to figure out how their content meshes with the web’s social layer. First, there’s the concentration-of-distribution idea: the way users discover, and then consume, media content on Twitter. Second, there’s the longevity idea: the lifespan of that content as it enters our consciousness and then, often almost instantly, fades away.

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Who says what to whom on Twitter? - Twitter flow

  1. 1. Who Says What to Whom on Twitter Shaomei Wu∗ Jake M. Hofman Cornell University, USA Yahoo! Research, NY, USA sw475@cornell.edu hofman@yahoo-inc.com Winter A. Mason Duncan J. Watts Yahoo! Research, NY, USA Yahoo! Research, NY, USA winteram@yahoo- djw@yahoo-inc.com inc.comABSTRACT 1. INTRODUCTIONWe study several longstanding questions in media communi- A longstanding objective of media communications re-cations research, in the context of the microblogging service search is encapsulated by what is known as Lasswell’s maxim:Twitter, regarding the production, flow, and consumption “who says what to whom in what channel with what ef-of information. To do so, we exploit a recently introduced fect” [9], so-named for one of the pioneers of the field, Haroldfeature of Twitter—known as Twitter lists—to distinguish Lasswell. Although simple to state, Laswell’s maxim hasbetween elite users, by which we mean specifically celebri- proven difficult to satisfy in the more-than 60 years sinceties, bloggers, and representatives of media outlets and other he stated it, in part because it is generally difficult to ob-formal organizations, and ordinary users. Based on this clas- serve information flows in large populations, and in partsification, we find a striking concentration of attention on because different channels have very different attributes andTwitter—roughly 50% of tweets consumed are generated by effects. As a result, theories of communications have tendedjust 20K elite users—where the media produces the most in- to focus either on “mass” communication, defined as “one-formation, but celebrities are the most followed. We also find way message transmissions from one source to a large, rela-significant homophily within categories: celebrities listen to tively undifferentiated and anonymous audience,” or on “in-celebrities, while bloggers listen to bloggers etc; however, terpersonal” communication, meaning a “two-way messagebloggers in general rebroadcast more information than the exchange between two or more individuals.” [13].other categories. Next we re-examine the classical “two-step Correspondingly, debates among communication theoristsflow” theory of communications, finding considerable sup- have tended to revolve around the relative importance ofport for it on Twitter, but also some interesting differences. these two putative modes of communication. For exam-Third, we find that URLs broadcast by different categories ple, whereas early theories such as the so-called “hypodermicof users or containing different types of content exhibit sys- model” posited that mass media exerted direct and relativelytematically different lifespans. And finally, we examine the strong effects on public opinion, mid-century researchers [10,attention paid by the different user categories to different 6, 11, 4] argued that the mass media influenced the pub-news topics. lic only indirectly, via what they called a two-step flow of communications, where the critical intermediate layer wasCategories and Subject Descriptors occupied by a category of media-savvy individuals called opinion leaders. The resulting “limited effects” paradigmH.1.2 [Models and Principles]: User/Machine Systems; was then subsequently challenged by a new generation ofJ.4 [Social and Behavioral Sciences]: Sociology researchers [5], who claimed that the real importance of the mass media lay in its ability to set the agenda of publicGeneral Terms discourse. But in recent years rising public skepticism oftwo-step flow, communications, classification mass media, along with changes in media and communica- tion technology, have tilted conventional academic wisdom once more in favor of interpersonal communication, whichKeywords some identify as a “new era” of minimal effects [2].Communication networks, Twitter, information flow Recent changes in technology, however, have increasingly∗ Part of this research was performed while the author was undermined the validity of the mass vs. interpersonal di-visiting Yahoo! Research, New York. chotomy itself. On the one hand, over the past few decades mass communication has experienced a proliferation of new channels, including cable television, satellite radio, special- ist book and magazine publishers, and of course an array ofPermission to make digital or hard copies of all or part of this work for web-based media such as sponsored blogs, online communi-personal or classroom use is granted without fee provided that copies are ties, and social news sites. Correspondingly, the traditionalnot made or distributed for profit or commercial advantage and that copies mass audience once associated with, say, network televisionbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specific has fragmented into many smaller audiences, each of whichpermission and/or a fee. increasingly selects the information to which it is exposed,WWW ’11 Hyderabad, India and in some cases generates the information itself. Mean-Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$10.00.
  2. 2. while, in the opposite direction interpersonal communication deliver qualitatively similar results. In section 4 we analyzehas become increasingly amplified through personal blogs, the production of information on Twitter, particularly whoemail lists, and social networking sites to afford individu- pays attention to whom. In section 4.1, we revisit the the-als ever-larger audiences. Together, these two trends have ory of the two-step flow—arguably the dominant theory ofgreatly obscured the historical distinction between mass and communications for much of the past 50 years—finding con-interpersonal communications, leading some scholars to refer siderable support for the theory as well as some interestinginstead to “masspersonal” communications [13]. differences. In section 5, we consider “who listens to what”, Nowhere is the erosion of traditional categories more ap- examining first who shares what kinds of media content, andparent than in the micro-blogging platform Twitter. To il- second the lifespan of URLs as a function of their origin andlustrate, the top ten most-followed users on Twitter are not their content. Finally, in section 6 we conclude with a briefcorporations or media organizations, but individual people, discussion of future work.mostly celebrities. Moreover, these individuals communi-cate directly with their millions of followers, often managedby themselves or publicists, thus bypassing the traditional 2. RELATED WORKintermediation of the mass media between celebrities and Aside from the communications literature surveyed above,fans. Next, in addition to conventional celebrities, a new a number of recent papers have examined information dif-class of “semi-public” individuals like bloggers, authors, jour- fusion on Twitter. Kwak et al. [8] studied the topologicalnalists, and subject matter experts have come to occupy an features of the Twitter follower graph, concluding from theimportant niche on Twitter, in some cases becoming more highly skewed nature of the distribution of followers and theprominent than traditional public figures such as entertain- low rate of reciprocated ties that Twitter more closely resem-ers and elected officials. Third, in spite of these shifts away bled an information sharing network than a social network—from centralized media power, media organizations—along a conclusion that is consistent with our own view. In ad-with corporations, governments, and NGOs—all remain well dition, Kwak et al. compared three different measures ofrepresented among highly followed users, and are often ex- influence—number of followers, page-rank, and number oftremely active. And finally, Twitter is primarily made up retweets—finding that the ranking of the most influentialof many millions of users who seem to be ordinary individ- users differed depending on the measure. In a similar vein,uals communicating with their friends and acquaintances in Cha et al. [3] compared three measures of influence—numbera manner largely consistent with traditional notions of in- of followers, number of retweets, and number of mentions—terpersonal communication. and also found that the most followed users did not neces- Twitter, therefore, represents the full spectrum of commu- sarily score highest on the other measures. Weng et al. [15]nications from personal and private to “masspersonal” to tra- compared number of followers and page rank with a modifiedditional mass media. Consequently it provides an interesting page-rank measure which accounted for topic, again findingcontext in which to address Lasswell’s maxim, especially as that ranking depended on the influence measure. Finally,Twitter—unlike television, radio, and print media—enables Bakshy et al. [1] studied the distribution of retweet cascadesone to easily observe information flows among the members on Twitter, finding that although users with large followerof its ecosystem. Unfortunately, however, the kinds of ef- counts and past success in triggering cascades were on aver-fects that are of most interest to communications theorists, age more likely to trigger large cascades in the future, thesesuch as changes in behavior, attitudes, etc., remain difficult features are in general poor predictors of future cascade size.to measure on Twitter. Therefore in this paper we limit Our paper differs from this earlier work by shifting atten-our focus to the “who says what to whom” part of Laswell’s tion from the ranking of individual users in terms of variousmaxim. influence measures to the flow of information among dif- To this end, our paper makes three main contributions: ferent categories of users. In particular, we are interested in identifying “elite” users, who we differentiate from “ordi- • We introduce a method for classifying users using Twit- nary” users in terms of their visibility, and understanding ter Lists into “elite” and “ordinary” users, further clas- their role in introducing information into Twitter, as well as sifying elite users into one of four categories of interest— how information originating from traditional media sources media, celebrities, organizations, and bloggers. reaches the masses. • We investigate the flow of information among these categories, finding that although audience attention is 3. DATA AND METHODS highly concentrated on a minority of elite users, much of the information they produce reaches the masses 3.1 Twitter Follower Graph indirectly via a large population of intermediaries. In order to understand how information is transmitted on • We find that different categories of users place slightly Twitter, we need to know the channels by which it flows; different emphasis on different types of content, and that is, who is following whom on Twitter. To this end, we that different content types exhibit dramatically dif- used the follower graph studied by Kwak et al. [8], which ferent characteristic lifespans, ranging from less than included 42M users and 1.5B edges. This data represents a day to months. a crawl of the graph seeded with all users on Twitter as observed by July 31st, 2009, and is publicly available1 . As The remainder of the paper proceeds as follows. In the reported by Kwak et al. [8], the follower graph is a directednext section, we review related work. In section 3 we dis- network characterized by highly skewed distributions bothcuss our data and methods, including section 3.3 in whichwe describe how we use Twitter Lists to classify users, out- 1 The data is free to download fromline two different sampling methods, and show that they http://an.kaist.ac.kr/traces/WWW2010.html
  3. 3. of in-degree (# followers) and out-degree (# “friends”, Twit- are interested in the relative importance of mass commu-ter notation for how many others a user follows); however, nications, as practiced by media and other formal organiza-the out-degree distribution is even more skewed than the tions, masspersonal communications as practiced by celebri-in-degree distribution. In both friend and follower distribu- ties and prominent bloggers, and interpersonal communica-tions, for example, the median is less than 100, but the max- tions, as practiced by ordinary individuals communicatingimum # friends is several hundred thousand, while a small with their friends. In addition, we are also interested in thenumber of users have millions of followers. In addition, the relationships between these categories of users, motivatedfollower graph is also characterized by extremely low reci- by theoretical arguments such as the theory of the two-stepprocity (roughly 20%)—in particular, the most-followed in- flow [6]. Rather than pursuing a strategy of automatic clas-dividuals typically do not follow many others. The Twitter sification, therefore, our approach depends on defining andfollower graph, in other words, does not conform to the usual identifying certain predetermined classes of theoretical in-characteristics of social networks, which exhibit much higher terest, where both approaches have advantages and disad-reciprocity and far less skewed degree distributions [7], but vantages. In particular, we restrict our attention to fourinstead resembles more the mixture of one-way mass com- classes of what we call “elite” users: media, celebrities, orga-munications and reciprocated interpersonal communications nizations, and bloggers, as well as the relationships betweendescribed above. these elite users and the much larger population of “ordi- nary” users.3.2 Twitter Firehose In additional to these theoretically-imposed constraints, In addition to the follower graph, we are interested in the our proposed classification method must also satisfy a prac-content being shared on Twitter—particularly URLs—and tical constraint—namely that the rate limits established byso we examined the corpus of all 5B tweets generated over Twitter’s API effectively preclude crawling all lists for alla 223 day period from July 28, 2009 to March 8, 2010 us- Twitter users3 . Thus we instead devised two different sam-ing data from the Twitter “firehose,” the complete stream pling schemes—a snowball sample and an activity sample—of all tweets2 . Because our objective is to understand the each with some advantages and disadvantages, discussed be-flow of information, it is useful for us to restrict attention to low.tweets containing URLs, for two reasons. First, URLs addeasily identifiable tags to individual tweets, allowing us to 3.3.1 Snowball sample of Twitter Listsobserve when a particular piece of content is either retweeted The first method for identifying elite users employed snow-or subsequently reintroduced by another user. And second, ball sampling. For each category, we chose a number u0 ofbecause URLs point to online content outside of Twitter, seed users that were highly representative of the desired cat-they provide a much richer source of variation than is pos- egory and appeared on many category-related lists. For eachsible in the typical 140 character tweet. Finally, we note of the four categories above, the following seeds were chosen:that almost all URLs broadcast on Twitter have been short-ened using one of a number of URL shorteners, of which the • Celebrities: Barack Obama, Lady Gaga, Paris Hiltonmost popular is http://bit.ly/. From the total of 5B tweets • Media: CNN, New York Timesrecorded during our observation period, therefore, we focusour attention on the subset of 260M containing bit.ly URLs. • Organizations: Amnesty International, World Wildlife Foundation, Yahoo! Inc., Whole Foods3.3 Twitter Lists Our method for classifying users exploits a relatively re- • Blogs4 : BoingBoing, FamousBloggers, problogger, mash-cent feature of Twitter: Twitter Lists. Since its launch on able. Chrisbrogan, virtuosoblogger, Gizmodo, Ileane,November 2, 2009, Twitter Lists have been welcomed by the dragonblogger, bbrian017, hishaman, copyblogger, en-community as a way to group people and organize one’s in- gadget, danielscocco, BlazingMinds, bloggersblog, Ty-coming stream of tweets by specific sets of users. To create coonBlogger, shoemoney, wchingya, extremejohn,a Twitter List, a user needs to provide a name (required) GrowMap, kikolani, smartbloggerz, Element321, bran-and description (optional) for the list, and decide whether donacox, remarkablogger, jsinkeywest, seosmarty, No-the new list is public (anyone can view and subscribe to this tAProBlog, kbloemendaal, JimiJones, ditescolist) or private (only the list creator can view or subscribe tothis list). Once a list is created, the user can add/edit/delete After reviewing the lists associated with these seeds, thelist members. As the purpose of Twitter Lists is to help users following keywords were hand-selected based on (a) theirorganize users they follow, the name of the list can be con- representativeness of the desired categories; and (b) theirsidered a meaningful label for the listed users. List creation lack of overlap between categories:therefore effectively exploits the “wisdom of crowds” [12] 3 The Twitter API allows only 20K calls per hour, where atto the task of classifying users, both in terms of their im- most 20 lists can be retrieved for each API call. Under theportance to the community (number of lists on which they modest assumption of 40M users (roughly the number in theappear), and also how they are perceived (e.g. news organi- 2009 crawl by [8]), where each user is included on at mostzation vs. celebrity, etc.). 20 lists, this would require 4 ∗ 106 /2 ∗ 103 = 2, 000 hours, or Before describing our methods for classifying users in terms 11 weeks. Clearly this time could be reduced by deployingof the lists on which they appear, we emphasize that we multiple accounts, but it also likely underestimates the real time quite significantly, as many users appear on many moreare motivated by a particular set of substantive questions than 20 lists (e.g. Lady Gaga appears on nearly 140,000)arising out of communications theory. In particular, we 4 The blogger category required many more seeds because bloggers are in general lower profile than the seeds for the2 http://dev.twitter.com/doc/get/statuses/firehose other categories
  4. 4. u0 Table 1: Distribution of users over categories l0 Snowball Sample Activity Sample category # of users % of users # of users % of users u1 celeb 82,770 15.8% 14,778 13.0% media 216,010 41.2% 40,186 35.3% l1 org 97,853 18.7% 14,891 13.1% blog 127,483 24.3% 43,830 38.6% u2 total 524,116 100% 113,685 100% l2 however, the bias is likely to be quite different from any in-Figure 1: Schematic of the Snowball Sampling troduced by the snowball sample; thus obtaining similar re-Method sults from the two samples should give us confidence that our findings are not artifacts of the sampling procedure. This • Celebrities: star, stars, hollywood, celebs, celebrity, method initially yielded 750k users and 5M lists; however, celebrities, celebsverified, celebrity-list,celebrities-on- after pruning the lists to those that contained at least of the twitter, celebrity-tweets keywords above, and assigning users to unique categories (as described above), we obtained a much-reduced sample • Media: news, media, news-media of 113,685 users, where Table 1 reports the number of users assigned to each category. We note that the number of lists • Organizations: company, companies, organization, obtained by the activity sampling methods is considerably organisation, organizations, organisations, corporation, smaller than that obtained by the snowball sample, and brands, products, charity, charities, causes, cause, ngo that bloggers are more heavily represented among the ac- tivity sample at the expense of the other three categories— • Blogs: blog, blogs, blogger, bloggers consistent with our claim that the two methods introduce different biases. Interestingly, however, 97,614 of the ac- Having selected the seeds and the keywords for each cate- tivity sample, or 85%, also appear in the snowball sample,gory, we then performed a snowball sample of the bipartite suggesting that the two sampling methods identifying sim-graph of users and lists (see Figure 1). For each seed, we ilar populations of elite users–as indeed we confirm in thecrawled all lists on which that seed appeared. The resulting next section.“list of lists” was then pruned to contain only the l0 listswhose names matched at least one of the chosen keywordsfor that category. For instance, Lady Gaga is on lists called 3.3.3 Classifying Elite Users“faves”, “celebs”, and “celebrity”, but only the latter two lists In order to identify categories of elite users, we not onlywould be kept after pruning. We then crawled all u1 users need to classify users into categories, but also arrive at a def-appearing in the pruned “list of lists” (for instance, find- inition of “elite” that satisfies a tradeoff between (a) keepinging all users that appeared in the “celebrity” list with Lady each category relatively small, so as not to include users whoGaga), and then repeated these last two steps to complete are not distinguishable from ordinary users, while (b) maxi-the crawl. In total, 524, 116 users were obtained, who ap- mizing the volume of attention that is accounted for by eachpeared on 7, 000, 000 lists; however, many of the more promi- category. In addition, it is also desirable to make the fournent users appeared on lists in more than one category—for categories the same size, so as to facilitate comparisons. Toexample Oprah Winfrey is frequently included in lists of this end, we first rank all users in each of category by how“celebrity” as well as “media.” To resolve this ambiguity, we frequently they are listed in that category. Next, we mea-computed a user i’s membership score in category c: sure the flow of information from the top k users in each nic of the four categories to a random sample of 100K ordinary wic = , (i.e. unclassified) users in two ways: the proportion of peo- Nc ple the user follows in each category, and the proportion ofwhere nic is the number of lists in category c that contain tweets the user received from everyone the user follows inuser i and Nc is the total number of lists in category c. each category.We then assigned each user to the category in which he Figure 2(a) shows for the snowball sample the share ofor she has the highest membership score. The number of following links (square symbols) and tweets received (dia-users assigned in this manner to each category is reported monds) by an average user, while Figure 2(b) shows thein Table 1. same information for the activity sample. Although the nu- merical values differ slightly, the two sets of results are qual- 3.3.2 Activity Sample of Twitter Lists itatively similar. In particular, for both sampling methods, Although the snowball sampling method is convenient and celebrities outrank all other categories, followed by the me-is easily interpretable with respect to our theoretical moti- dia, organizations, and bloggers. Also in both cases, thevation, it is also potentially biased by our particular choice bulk of the attention is accounted for by a relatively smallof seeds. To address this concern, we also generate a sample number of users within each category, as evidenced by theof users based on their activity. Specifically, we crawl all relatively flat slope of the attention curves, where we notelists associated with all users who tweet at least once every that the curve for celebrities asymptotes more slowly thanweek for our entire observation period. for the other three categories. Balancing the requirements This “activity-based” sampling method is also clearly bi- described above, therefore, we chose k = 5000 as a cut-offased towards users who are consistently active. Importantly, for the elite categories, where all remaining users are hence-
  5. 5. forth classified as ordinary. In addition, from this point on,we restrict our analysis to elite categories to the top 5,000 Table 2: # of URLs initiated by categoryusers identified by the sampling method, noting that both # of URLsmethods generate similar results. category # of URLs per-capita celeb 139,058 27.81 celebrities media media 5,119,739 1023.94 friends org 523,698 104.74 10 20 30 10 20 30 tweets received blog 1,360,131 272.03 average % average % ordinary 244,228,364 6.10 friends tweets received 0 0 1000 4000 7000 10000 1000 4000 7000 10000 dle for actor Ashton Kusher, one of the first celebrities to top k top k embrace Twitter and still one of the most followed users, while the remain celebrity users—Lady Gaga, Ellen De- organizations blogs generes, Oprah Winfrey, and Taylor Swift, are all household friends friends 10 20 30 10 20 30 tweets received tweets received names. In the media category, CNN Breaking News and the average % average % New York Times are most prominent, followed by Breaking News, Time, and Asahi, a leading Japanese daily newspa- per. Among organizations, Google, Starbucks, and Twit- ter are obviously large and socially prominent corporations, 0 0 1000 4000 7000 10000 1000 4000 7000 10000 top k top k while JoinRed is the charity organization started by Bono of U2, and ollehkt is the Twitter account for KT, formerly Ko- (a) Snowball sample rean Telecom. Finally, among the blogging category, Mash- able and ProBlogger are both prominent US blogging sites, celebrities media while Kibe Loco and Nao Salvo are popular blogs in Brazil, friends 10 20 30 10 20 30 tweets received and dooce is the blog of Heather Armstrong, a widely read average % average % “mommy blogger” with over 1.5M followers. friends tweets received Table 3: Top 5 users in each category 0 0 1000 4000 7000 10000 1000 4000 7000 10000 top k top k Celebrity Media Org Blog aplusk cnnbrk google mashable organizations blogs ladygaga nytimes Starbucks problogger friends friends TheEllenShow asahi twitter kibeloco 10 20 30 10 20 30 tweets received tweets received average % average % taylorswift13 BreakingNews joinred naosalvo Oprah TIME ollehkt dooce 0 0 1000 4000 7000 10000 1000 4000 7000 10000 top k top k 4. “WHO LISTENS TO WHOM” The results of the previous section provide qualified sup- (b) Activity sample port for the conventional wisdom that audiences have be- come increasingly fragmented. Clearly, ordinary users onFigure 2: Average fraction of # following (blue line) Twitter are receiving their information from many thou-and # tweets (red line) for a random user that are sands of distinct sources, most of which are not traditionalaccounted for by the top K elites users crawled media organizations—even though media outlets are by far the most active users on Twitter, only about 15% of tweets Based on this definition of elite users, Table 2 shows that received by ordinary users are received directly from thealthough ordinary users collectively introduce by far the media. Equally interesting, however, is that in spite of thishighest number of URLs, members of the elite categories are fragmentation, it remains the case that 20K elite users, com-far more active on a per-capita basis. In particular, users prising less than 0.05% of the user population, attracts al-classified as “media” easily outproduce all other categories, most 50% of all attention within Twitter. Even if the mediafollowed by bloggers, organizations, and celebrities. Ordi- has lost attention relative to other elites, information flowsnary users originate on average only about 6 URLs each, have not become egalitarian by any means.compared with over 1,000 for media users. In the rest of The prominence of elite users also raises the question ofthis paper, therefore, when we talk about “celebrity”, “me- how these different categories listen to each other. To ad-dia”, “organization”, “blog”, we refer the top 5K users drawn dress this issue, we compute the volume of tweets exchangedfrom the snowball sample listed as “celebrity”, “media”, “or- between elite categories. Specifically, Figure 3 shows theganization”, “blog”, respectively. average percentage of tweets that category i receives from Table 3, which shows the top 5 users in each of the four category j, exhibiting striking homophily with respect tocategories, suggests that the sampling method yields re- attention: celebrities overwhelmingly pay attention to othersults that are consistent with our objective of identifying celebrities, media actors pay attention to other media ac-users who are prominent exemplars of our target categories. tors, and so on. The one slight exception to this rule is thatAmong the celebrity list, for example, “aplusk,” is the han- organizations pay more attention to bloggers than to them-
  6. 6. light on the theory of the two-step flow, arguably the theory that has most successfully captured the dueling importance of mass media and interpersonal influence. The essence of the two-step flow is that information passes from the media Celeb Media to the masses not directly, as supposed by early theories of mass communication, but passes first through an intermedi- Category of Twitter Users ate layer of “opinion leaders” who decide which information A B to rebroadcast to their followers, and which to ignore. As B receive tweets from A we have already noted, on Twitter the flow of information to the masses from the media accounts for only a fraction Org Blog of the total volume of information. Nevertheless, it is still a substantial fraction, so it is still interesting to ask: for the special case of information originating from media sources, what proportion is broadcast directly to the masses, and what proportion is transmitted indirectly via some popula- tion of intermediaries? In addition, we may inquire whetherFigure 3: Share of tweets received among elite cat- these intermediaries, to the extent they exist, are drawnegories from other elite categories or from ordinary users, as claimed by the two-step flow theory; and if the latter, in what re- spects they differ from other ordinary users. Before proceeding with this analysis, we note that there are two ways information can pass through an intermedi- ary in Twitter. The first is via retweeting, which occurs when a users explicitly rebroadcasts a URL that he or she has received, along with an explicit acknowledgement of the source—either using official retweet function provided by Twitter, or making use of an informal convention such as “RT @user” or “via @user.” The second mechanism is what we label reintroduction, where a user subsequently tweets a URL that has previously been introduced by another user, but without the acknowledgment, in which case we assume the information has been rediscovered independently. For the purposes of studying when a user receives information directly from the media or indirectly through an intermedi- ary, we treat retweets and reintroductions equivalently. If Figure 4: RT behavior among elite categories the first occurrence of a URL in Twitter came from a media user, but a user received the URL from another source, then that source can be considered an intermediary, whether theyselves. In general, in fact, attention paid by organizations is are citing the source within Twitter by retweeting the URL,more evenly distributed across categories than for any other or reintroducing it, having discovered the URL outside ofcategory. Twitter. Figure 3, it should be noted, shows only how many URLs To quantify the extent to which ordinary users get theirare received by category i from category j, a particular weak information indirectly versus directly from the media, wemeasure of attention for the simple reason that many tweets sampled 1M random ordinary users5 , and for each user,go unread. A stronger measure of attention, therefore, is counted the number n of bit.ly URLs they had received thatto consider instead only those URLs introduced by category had originated from one of our 5K media users, where ofi that are subsequently retweeted by category j. Figure 4 the 1M total, 600K had received at least one such URL.shows how much information originating from each category For each member of this 600K subset we then counted theis retweeted by other categories. As with our previous mea- number n2 of these URLs that they received via non-mediasure of attention, retweeting is strongly homophilous among friends; that is, via a two-step flow. The average fractionelite categories; however, bloggers are disproportionately re- n2 /n = 0.46 therefore represents the proportion of media-sponsible for retweeting URLs originated by all categories. originated content that reaches the masses via an interme-This result reflects the characterization of bloggers as recy- diary rather than directly. As Figure 5 shows, however,clers and filters of information. However, even though on a this average is somewhat misleading. In reality, the pop-per-capita basis bloggers disproportionately occupy the role ulation comprises two types—those who receive essentiallyof information recyclers—93 retweets per person, compared all of their media-originating information via two-step flowsto only 1.1 retweets per person for ordinary users—the total and those who receive virtually all of it directly from thenumber of URLs retweeted by bloggers (465k) is vastly out- media. Unsurprisingly, the former type is exposed to lessweighed by the number retweeted by ordinary users (46M); total media than the latter. What is surprising, however, isthus their overall impact is relatively minimal. that even users who received up to 100 media URLs dur- 5 As before, performing this analysis for the entire population4.1 Two-Step Flow of Information of over 40M ordinary users proved to be computationally Examining information flow on Twitter can also shed new unfeasible.
  7. 7. a b 10000 # of opinion leaders 100 c d 1 0 4 16 64 256 2048 16384 131072 # of two−step recipientsFigure 5: Percentage of information that is received Figure 6: Frequency of intermediaries binned by #via an intermediary as a function of total volume of randomly sampled users to whom they transmit me-media content to which a user is exposed. dia content.ing our observation period received all of them from opinion communications technology in the interim—given, in fact,leaders. that a service like Twitter was likely unimaginable at the Who are these intermediaries, and how many of them are time—it is remarkable how well the theory agrees with ourthere? In total, the population of intermediaries is smaller observations.than that of the users who rely on them, but still surprisinglylarge, roughly 500K, the vast majority of which (96%) are 5. WHO LISTENS TO WHAT?classified ordinary users, not elites. Interestingly, Figure 5c The results in section 4 demonstrate the “elite” users ac-also shows that at least some intermediaries also receive the count for a substantial portion of all of the attention onbulk of their media content indirectly, just like other ordi- Twitter, but also show clear differences in how the attentionnary users. Comparing Figure 5a and 5c, however, we note is allocated to the different elite categories. It is thereforethat intermediaries are not like other ordinary users in that interesting to consider what kinds of content is being sharedthey are exposed to considerably more media than randomly by these categories. Given the large number of URLs in ourselected users, hence the number of intermediaries who rely observation period (260M ), and the many different ways oneon two-step flows is much smaller than for random users. In can classify content (video vs. text, news vs. entertainment,addition, we find that on average intermediaries have more political news vs. sports news, etc.), classifying even a smallfollowers than randomly sampled users (543 followers versus fraction of URLs according to content is an onerous task.34) and are also more active (180 tweets on average, versus Bakshy et al. [1], for example, used Amazon’s Mechanical7). Finally, Figure 6 shows that although all intermediaries, Turk to classify a stratified sample of 1,000 URLs along aby definition, pass along media content to at least one other variety of dimensions; however, this method does not scaleuser, a minority satisfies this function for multiple users, well to larger sample sizes.where we note that the most prominent intermediaries are Instead, we restrict attention to URLs originated by thedisproportionately drawn from the 4% elite users—Ashton New York Times which, with over 2.5M followers, is theKucher (asplusk), for example acts as an intermediary for second-most-followed news organization on Twitter, afterover 100,000 users. CNN Breaking News. NY Times, however, is roughly ten Interestingly, these results are all broadly consistent with times as active as CNN Breaking News, so it is arguable athe original conception of the two-step flow, advanced over better source of data. To classify NY Times content, we50 years ago, which emphasized that opinion leaders were exploit a convenient feature of their format—namely that“distributed in all occupational groups, and on every social all NY Times URLs are classified in a consistent way byand economic level,” corresponding to our classification of the section in which they appear (e.g. U.S., World, Sports,most intermediaries as ordinary. [6]. The original theory Science, Arts, etc.) 6 . Of the 6398 New York Times bit.lyalso emphasized that opinion leaders, like their followers, URLs we observed, 6370 could be successfully unshortenedalso received at least some of their information via two-step and assigned to one of 21 categories. Of these, however, onlyflows, but that in general they were more exposed to the 9 categories had more than 100 URLs during the observa-media than their followers—just as we find here. Finally, tion period, one of which—“NY region”—was highly specificthe theory predicted that opinion leadership was not a bi- to the New York metropolitan area; thus we focused ournary attribute, but rather a continuously varying one, cor- attention on the remaining 8 topical categories. Figure 7responding to our finding that intermediaries vary widely in shows the proportion of URLs from each New York Timesthe number of users for whom they act as filters and trans- section retweeted or reintroduced by each category. Worldmitters of media content. Given the length of time that haselapsed since the theory of the two-step flow was articulated, 6 http://www.nytimes.com/year/month/day/category/and the transformational changes that have taken place in title.html?ref=category
  8. 8. first observation last observation 1. World News 2. U.S. News of URL of URL 0.35 δ δ 0.30 0.25 0.20 0.15 0.10 estimation period = 133 days evaluation period = 90 days 0.05 0.00 δ 3. Business 4. Sports 0.35 Total observation window = 223 days 0.30 0.25 0.20 Figure 8: Schematic of lifespan estimation proce- % RTs and Re-introductions 0.15 0.10 dure 0.05 0.00 5. Health 6. Technology 0.35 create large biases for long lived URLs. In particular, URLs 0.30 that appear towards the end of our observation period will 0.25 0.20 be systematically classified as shorter-lived than URLs that 0.15 appear towards the beginning. 0.10 To address the censoring problem, we seek to determine 0.05 0.00 a buffer δ at both the beginning and the end of our 223- 7. Science 8. Arts day period, and only count URLs as having a lifespan of τ 0.35 if (a) they do not appear in the first δ days, (b) they first 0.30 0.25 appear in the interval between the buffers, and (c) they do 0.20 not appear in the last δ days, as illustrated in Figure 8(a). 0.15 To determine δ we first split the 223 day period into two 0.10 0.05 segments—the first 133 day estimation period and the last 0.00 90 day evaluation period (see Figure 8(b))—and then ask: if blog celeb media org other blog celeb media org other we (a) observe a URL first appear in the first 133−δ days and User Category (b) do not see it in the δ days prior to the splitting point, how likely are we see it in the last 90 days? Clearly this dependsFigure 7: Number of RTs and Reintroductions of on the actual lifespan of the URL, as the longer a URLNew York Times stories by content category lives, the more likely it will re-appear in the future. Using this estimation/evaluation split, we find an upper-bound on lifespan for which we can determine the actual lifespan withnews is the most popular category, followed by U.S. News, 95% accuracy as a function of δ. Finally, because we requireBusiness, and Sports, where increasingly niche categories a beginning and ending buffer, and because we can onlylike Health, Arts, Science, and Technology are less popu- classify a URL as having lifespan τ if it appears at least τlar still. In general, the overall pattern is replicated for all days before the end of our window, we need to pick τ andcategories of users, but there are some minor deviations: in δ such that τ + 2δ ≤ 223. We determined that τ = 70particular, organizations show disproportionately little in- and δ = 70 sufficiently satisfied our constraints; thus forterest in business and arts-related stories, and dispropor- the following analysis, we consider only URLs that have ationately high interest in science, technology, and possibly lifespan τ ≤ 70.world news. Celebrities, by contrast, show greater interestin sports and less interest in health, while the media shows 5.2 Lifespan By Categorysomewhat greater interest in U.S. news stories. Having established a method for estimating URL lifes- pan, we now explore the lifespan of URLs introduced by5.1 Lifespan of Content different categories of users, as shown in Figure 9(a). URLs In addition to different types of content, URLs introduced initiated by the elite categories exhibit a similar distribu-by different types of elite users or ordinary users may exhibit tion over lifespan to those initiated by ordinary users. Asdifferent lifespans, by which we mean the time lag between Figure 9(b) shows, however, when looking at the percent-the first and last appearance of a given URL on Twitter. age of URLs of different lifespans initiated by each category, Naively, measuring lifespan seems a trivial matter; how- we see two additional results: first, URLs originated by me-ever, a finite observation period—which results in censoring dia actors generate a large portion of short-lived URLs (es-of our data—complicates this task. In other words, a URL pecially URLs with lifespan=0, those that only appearedthat is last observed towards the end of the observation pe- once); and second, URLs originated by bloggers are over-riod may be retweeted or reintroduced after the period ends, represented among the longer-lived content. Both of thesewhile correspondingly, a URL that is first observed toward results can be explained by the type of content that origi-the beginning of the observation window may in fact have nates from different sources: whereas news stories tend tobeen introduced before the window began. What we ob- be replaced by updates on a daily or more frequent basis,serve as the lifespan of a URL, therefore, is in reality a the sorts of URLs that are picked up by bloggers are of morelower bound on the lifespan. Although this limitation does persistent interest, and so are more likely to be retweeted ornot create much of a problem for short-lived URLs—which reintroduced months or even years after their initial intro-account for the vast majority of our observations—it does duction.
  9. 9. 20 log(# of URLs with lifespan = x day) q other q celeb q media 15 org q blog q q qq q q q qq q qq qq qq qqqqqqq 10 qqqq qqq qqqqq qqqqq qqq qqqqq qqqqqqq qqq qqqq qqq qqq 5 0 0 10 20 30 40 50 60 70 Figure 10: Top 20 domains for URLs that lived more lifespan (day) than 200 days (a) Count RT rate (# of RTs / total # of occurrences) 1.0 q other 7 celeb celeb media media 0.8 org % of URLs from elites category 6 org blog blog 5 0.6 4 0.4 3 q q 0.2 q qq q qq 2 qqqq q q q qqqqqqqq qqqqqqqqqqqqqqq qqq qqq qqq qqq qqq qqq q q qq qqqqqq q q qq 1 0.0 q 0 10 20 30 40 50 60 70 0 lifespan (day) 0 10 20 30 40 50 60 70 lifespan (day) (b) Percent Figure 11: Average RT rate by lifespan for each of the originating categoriesFigure 9: 9(a) Count and 9(b) percentage of URLsinitiated by 5 categories, with different lifespans of appearances of URLs after the initial introduction derives not from retweeting, but rather from reintroduction, where this result is especially pronounced for long-lived URLs. To shed more light on the nature of long-lived content on For the vast majority of URLs on Twitter, in other words,Twitter, we used the bit.ly API service to unshorten 35K longevity is determined not by diffusion, but by many dif-of the most long-lived URLs (URLs that lived at least 200 ferent users independently rediscovering the same content,days), and mapped them into 21034 web domains. As Figure consistent with our interpretation above. Second, however,10 shows, the population of long-lived URLs is dominated by for URLs introduced by elite users, the result is somewhatvideos, music, and books. Twitter is, in other words, should the opposite—that is, they are more likely to be retweetedbe viewed as a subset of a much larger media ecosystem in than reintroduced, even for URLs that persist for weeks.which content exists and is repeatedly rediscovered by Twit- Although it is unsurprising that elite users generate moreter users. Some of this content—such as daily news stories— retweets than ordinary users, the size of the difference ishas a relatively short period of relevance, after which a given nevertheless striking, and suggests that in spite of the dom-story is unlikely to be reintroduced or rebroadcast. At the inant result above that content lifespan is determined to aother extreme, classic music videos, movie clips, and long- large extent by type, the source of its origin also impacts itsformat magazine articles have lifespans that are effectively persistence, at least on average—a result that is consistentunbounded, and can seemingly be rediscovered by Twitter with previous findings [1].users indefinitely without losing relevance. Two related points are illustrated by Figure 11, whichshows the average RT rate (the proportion of tweets con- 6. CONCLUSIONStaining the URL that are retweets of another tweet) of URLs In this paper, we investigated a classic problem in me-with different lifespans, grouped by the categories that in- dia communications research, captured by the first part oftroduced the URL7 . First, for ordinary users, the majority Laswell’s maxim—“who says what to whom”—in the context that only appeared once in our dataset, thus the RT rate is7 Note here that URLs with lifespan = 0 are those URLs zero.
  10. 10. of Twitter. By restricting our attention to Twitter, our con- 7. REFERENCESclusions are necessarily limited to one narrow cross-section of [1] E. Bakshy, J. M. Hofman, A. Mason, Winter, andthe media landscape. Moreover, communications on Twitter D. J. Watts. Identifying ‘influencers’ on twitter. Inmay be unrepresentative of information flow via more tradi- Fourth ACM International Conference on Web Seachtional channels, such as TV and radio on the one hand, and and Data Mining (WSDM), Hong Kong, 2011. ACM.interpersonal interactions on the other hand. However, we [2] W. L. Bennett and S. Iyengar. A new era of minimalfeel the advantages of using Twitter to answer this question effects? the changing foundations of politicaloutweighed the limitations. First, because Twitter users ex- communication. Journal of Communication,plicitly opt-in to “follow” each other, and because Twitter 58(4):707–731, 2008.maintains a complete record of every tweet broadcast, it [3] M. Cha, H. Haddadi, F. Benevenuto, and K. P.provides an unprecedented level of resolution and coverage Gummad. Measuring user influence on twitter: Theregarding who is listening to whom. Second, because Twit- million follower fallacy. In 4th Int’l AAAI Conferenceter users themselves classify other users by including them on Weblogs and Social Media, Washington, DC, 2010.on lists, Twitter effectively provides a ready-made, crowd- [4] J. S. Coleman, E. Katz, and H. Menzel. The diffusionsourced classification scheme of users. of an innovation among physicians. Sociometry, By studying the flow of information among the five cat- 20(4):253–270, 1957.egories that we identified (media, celebrities, organizations,bloggers, and ordinary), our analysis sheds new light on [5] T. Gitlin. Media sociology: The dominant paradigm.some old questions of communications research. First, we Theory and Society, 6(2):205–253, 1978.find that although audience attention has indeed fragmented [6] E. Katz and P. F. Lazarsfeld. Personal influence; theamong a wider pool of content producers than classical mod- part played by people in the flow of massels of mass media, attention remains highly concentrated, communications. Free Press, Glencoe, Ill. 1955. ”where roughly 0.05% of the population accounts for almost [7] G. Kossinets and D. J. Watts. Empirical analysis of anhalf of all attention. Within the population of elite users, evolving social network. Science, 311(5757):88–90,moreover, attention is highly homophilous, with celebrities 2006.following celebrities, media following media, and bloggers [8] H. Kwak, C. Lee, H. Park, and S. Moon. What isfollowing bloggers. Second, we find considerable support for twitter, a social network or a news media? Inthe two-step flow of information—almost half the informa- Proceedings of the 19th international conference ontion that originates from the media passes to the masses indi- World Wide Web, pages 591–600. ACM, 2010.rectly via a diffuse intermediate layer of opinion leaders, who [9] H. D. Lasswell. The structure and function ofalthough classified as ordinary users, are more connected communication in society. In L. Bryson, editor, Theand more exposed to the media than their followers. Third, Communication of Ideas, pages 117–130. University ofwe find that although all categories devote a roughly simi- Illinois Press, Urbana, IL, 1948.lar fraction of their attention to different categories of news [10] P. F. Lazarsfeld, B. Berelson, and H. Gaudet. The(World, U.S., Business, etc), there are some differences— people’s choice; how the voter makes up his mind in aorganizations, for example, devote a surprisingly small frac- presidential campaign. Columbia University Press,tion of their attention to business-related news. We also find New York, 3rd edition, 1968.that different types of content exhibit very different lifes- [11] R. K. Merton. Patterns of influence: Local andpans. In particular, media-originated URLs are dispropor- cosmopolitan influentials. In R. K. Merton, editor,tionately represented among short-lived URLs while those Social theory and social structure, pages 441–474. Freeoriginated by bloggers tend to be overrepresented among Press, New York, 1968.long-lived URLs. Finally, we find that the longest-lived [12] J. Surowiecki. The Wisdom of Crowds : Why the manyURLs are dominated by content such as videos and music, are smarter than the few and how collective wisdomwhich are continually being rediscovered by Twitter users shapes business, economies, societies, and nations.and appear to persist indefinitely. Doubleday, New York, 1st edition, 2004. 2003070095 In closing, we note that although our use of Twitter lists James Surowiecki. Includes bibliographical references.to label users was motivated by a specific set of questions [13] J. B. Walther, C. T. Carr, S. S. W. Choi, D. C.regarding mass vs interpersonal communications, and that DeAndrea, J. Kim, S. T. Tong, and B. Van Der Heide.for this reason we have focused on a limited set of predeter- Interaction of interpersonal, peer, and media influencemined user-categories, it would also be interesting to explore sources online. In Z. Papacharissi, editor, A Networkedautomatic classification schemes from which additional user Self: Identity, Community, and Culture on Socialcategories could emerge. In particular, such an approach Network Sites, pages 17–38. Routledge, 2010.would allow one to examine the category of opinion lead- [14] G. Weimann. The Influentials: People Who Influenceers in more detail, possibly identifying opinion leaders for People. State University of New York Press, Albany,different topics, as has been proposed elsewhere [14]. In NY, 1994.addition, another area for future work would be to extractcontent information in a more systematic manner, shedding [15] J. Weng, E. P. Lim, J. Jiang, and Q. He. Twitterrank:more light on the “what” element of Lasswell’s maxim. And finding topic-sensitive influential twitterers. Infinally, a significant challenge for future work is to merge Proceedings of the third ACM international conferencethe data regarding information flow on Twitter with other on Web search and data mining, pages 261–270. ACM,sources of outcome data—relating, for example, to opinions 2010.or actions that would engage more directly with the “effects”component of Lasswell’s maxim.