Hi, I’m brianlarson.PhD Student in Rhetoric and Scientific and Technical Communication at U of M. I came to talk about “Examining a Twitter-Based Discourse Community of Composition Scholars”. But that presentation is based on a paper I wrote back in the spring of 2012, which was the foundation for the proposal that Jan, Kristin, and I made to present here at the Cs. The paper is posted up on the Cs resources page for this section; it’s also available on my blog at www.Rhetoricked.com. (Sorry for the small number of handouts, but the blog has all the same information.)I hope you will also connect with me on Twitter: I’m @RhetorickedAnyway, all this was before my prelims hit. I finished writing my fourth and final 24-hour prelim while I was here yesterday morning. Last night, basking in the glow of post-prelim, I reconsidered the paper and presentation, or should I say, I considered justADVANCE trashing it.Then I thought, “Well, what I really want to do is problematize my original goal.” So, here is the revised title. ADVANCE My presentation will veer a little bit away from the original and the paper, but let’s see what happens.
You’ve probably seen visualizations of social network graphs like this one at presentations (or if you have not yet, you will). They usually represent very superficial analyses of what I call “low hanging data.” Anyone can set up a Twitter archive for an event and analyze connections between the folks who who used it. Once you have that data, you can run it through open access applications like NodeXL and generate visualizations like this, that are in some senses informative. But I have some doubts about the validity of some of this research, on the grounds that it does not connect the kinds of things these social graphs identify as meaningful with theoretical constructs within writing studies.I wanted explore whether such a Twitter archive could be analyzed using social graph theory in a way that would connect to theory.
Genre theory is usually interested in contextualized writing practices. So for example, studies in genre theory usually circumscribe some community or group and examine the writing and other practices of its members.of students within a single class; of employees working in a workplace; of researchers in a particular discipline; or even of a closed but more diffuse network like a Usenet group. You can’t just go out and dip into Twitter in hopes that the sample you’ll get will allow you to identify relations. After 10 seconds of sampling, you’d have 55,000 or more tweets, and the odds are that none of the folks tweeting in that sample are connected to each other than by the fact that they are Twitter subscribers. This is an example of what computational linguists call a “sparsity problem”—the phenomenon of interest is too sparsely distributed in the population to sample it randomly.The solution? Some kind of “root” for sampling
So, how do we figure out whether some subset is a community? That depends, of course, on what you consider a community. All of the usual genre-theoretical definitions are useless here… so how about sociology of social networks…?
The Gruzd et al piece is typical of a usual sort. Common language: One possibility is to assign linguistic features to tweets and then group them by tweets (or subscribers) that share features. However, this assumes that the common language characteristic is actually selective of groups, and that the linguistic features of the tweets allow discrimination among groups. (Sparsity problem)High centers: entities that “society is naturally organized around and under”. In social network, theory, this might include something like the government, which has a relation with each of us, but which may, or may not, do much to mediate our formation into communities.Last of these is tough, because many hashtags are for specific times/places. Others, like #FYChat for for folks interested in First Year Comp might prove more cohesive out of the box.
Gruzd also identified key phenomenological variables for assessing the existence of a community:So here, we have quite a few variables to use to try to assess whether a subset of Twitter subscribers counts as a community. How can we relate that to data from Twitter itself and how do we make sense of pretty pictures like the one I showed in the first slide? That takes us to network analysis theory….
Here’s visualization of a graph with six nodes (the circles). Let’s assume it’s a graph of friendship relations. Unlike Plato’s Socrates in the Meno, I interpret friendship as fully reciprocal. Thus, there are no arcs or arrows here. There are seven friendship relations or edges, represented by the lines between circles.Advance: In this graph, node six is friends with node four, Advance: which is also friends with nodes 3 and 5.It’s important to remember that the visualization is NOT the graph. It’s just a representation of the graph. It’s fully isomorphic with a listing of the relations.ADVANCE
Here is an unordered list of unordered pairs, which completely represents the graph visualized here.One relation between nodes is called the geodesic distance, or the shortest route through the graph between them. In this graph, for example, the geodesic distance bwteen nodes 1 and 6 is threeADVANCE because it takes a minimum of three steps to get from 1 to 3.
One way you can look at a group of nodes is to calculate its density. The highest density you can have is a “complete graph”: that’s one where every node connects to every other node. ADVANCE: The highest number of edges you can have in a complete graph with N nodes is given by this formula. ADVANCE in our 6-node graph, that’s 15. ADVANCE The density of the graph is the actual number of edges divided by the total possible number.We can look at density in any graph or subgraph, but we also want to be able to identify individuals who play important roles in a group. We can do that with measures of centralityADVANCE
There are three measures of centrality. Degree centrality is the simplest: it’s the number of edges emanating from a node. A higher number means the node is more central. In this example.
With closeness centrality, a lower number means the node is more central. That’s because it’s calculated as the average of the distance from the node of interest to each other node.
Betweenness centrality is a measure of how often a vertex lies on the shortest route between two other vertices. So, in Figure 2, Person 1 has a betweenness centrality of 0, as no other person needs to connect through Person 1; Person 5 has a betweenness centrality of 2, because she lies on the shortest route between Person 1 and Persons 4 and 6; Person 4 has the highest betweenness centrality at 4, as she lies on the only route between Person 6 and the other four people.Another potentially important measure of a node’s potential position within a network or community is its clustering coefficient. This is a measure of the density of a sub-network consisting of the person’s connections (Hansen 2010, 3.5.2). Thus, if a person’s friends are all friends of each other, she has a high clustering coefficient.In the example in Figure 2, Person 1 has a clustering coefficient of 1, because her two friends, Persons 2 and 5, are friends of each other. Person 3 has two friends who are not friends of each other, and thus has a clustering coefficient of 0. Person 5 falls in the middle: of the three possible relations between Persons 1, 2, and 4, only one exists, making Person 5’s clustering coefficient 0.33.
Over- and under-inclusive: There be members of a community not attending an event with a hashtag. There may be folks attending the event, using the hashtag, who are not really members of any community (in our theoretical sense).On an exploratory basis, I measured Interactivity among members by using the clustering coefficient of the candidate community compared to a randomly generated clustering coefficient and compared to edges between members and non-members. So, let’s look at a data set…
This image shows the results of NodeXL’s efforts to automatically classify the Twitter accounts into groups or candidate communities based on interactivity. The rectangles drawn around subsets of nodes represent NodeXL’s best guess as to the boundaries of each candidate community. In this case, the size of each node represents its degree, the number of times this account sent or received @-replies or retweets. It’s clear that two members of the CCCCs hashtag ‘community,’ jenlmichaelsand mrsalander, have high degree ratings and also have connections broadly across all candidate communities, not just within their ‘home’ groupings. Within the candidate groups, members appear to have relatively few connections to each other, so it’s possible that the relevant clustering coefficients, though significant enough to cause NodeXL to group the nodes together, might not represent a real sense of membership in any of these groups.But does this graph visualization MEAN anything? My answer is “Not yet.”
4C13 J.15 Larson "Twitter based discourse community"
Finding problems withExamining a Twitter-Based Discourse Community of Composition Scholars Brian N. Larson @Rhetoricked / www.Rhetoricked.com Department of Writing Studies, University of Minnesota
Motivation• Ph.D. Seminar spring 2012 – “Emerging Genres on the Internet” – Dr. Carol Berkenkotter• Do certain Twitter practices constituted genres? – But genres belong to communities – …or to activity systems – Both theoretically bounded
Problem• What theoretically licensed means can I use to sample a subset of the Twitter population?• Can I implement that practically?• I smugly proposed what’s in the paper• I now repent, but think the questions are still worth exploring
Outline: Define your terms, get some data, and see where you’re at!• Dealing with the Twitter fire hose• “Discourse community lite”• Social network theory basics• Example of data from 2012 CCCCs• Urging caution, suggesting next steps
Challenge: Sampling Twitter• Genre theory considers communities – Swales 1990; Berkenkotter & Huckin 1994; Russell 1994 (activity systems); Devitt 2004 (communities, collectives, social networks)• How to sample? Sparsity problems – Twitter has 200 million active users – More than ½ billion tweets per day (5500/second) – Need a “root”
Hashtag as root?• Potential virtues – Easy to search for – Definitive threshold characteristic• Potential vices – Unknown relation to theoretical concerns – Over- and under-inclusive• How about hashtag + follows of users of hashtag? – Assumption: Many communities within hashtag sample
Community? Twitter users span geographies and topics• (Gruzd et al. 2010)• A common language• “Temporality” or community shares “a consciousness of a shared temporal dimension in which they co-exist”• The decline in prominence of “high centers”• Interactivity among members• A variety of communicators• “[C]ommon public place where members can meet and interact”• “[S]ustained membership over time”
Community: Sense of community• (Gruzd et al. 2010)• Members feel that they are members• Members have influence within the community• Community meets some member needs• Members share an emotional connection
Network analysis theory• A graph – “Nodes” or “vertices” represent individuals – “Edges” or “arcs” represent relationships between the individuals.• In visual representations – A node is represented as a point on the graph – An edge is a line between nodes – An arc is a line with an arrow, i.e., a unidirectional relationship
Figure 2: Friendship diagram Example of network graph. Source: Wikimedia commons.
Figure 2: Closeness centrality① Closeness = 1.8②③ Closeness = 1.6④ Average distance⑤ Closeness = 1.4 from node to each⑥ other nodeLower = more central
Figure 2: Betweenness centrality and clustering coefficient
Centrality summarized• Degree: How many connections does node have?• Closeness: How close is node to other nodes in graph?• Betweenness: To what extent does node lie along “critical paths”?
What are the edges in Twitter?• Not necessarily reciprocal: They are directional or “arcs”• I follow you, you may or may not follow me• I @-reply to you…• I retweet you…
Community candidate variables• Start with a Twitter hashtag – Possibily over- and under-inclusive…• Bound sample in time• Find who each subscriber follows• Interactivity among members: Density of edges representing @-replies and retweets among candidate group members• Individual influence: Density of edges representing @-replies and retweets among candidate group members
CCCC 2012: Twitter data set• Jen Michaels set up the archive• March 9-23, 2012• CCCCs hashtags• 5,000+ tweets, nearly 600 subscribers• Power distribution of tweets – About 115 subscribers tweeted >10 times – Fewer than 60 tweeted >25 times – Six subscribers tweeted 100+ times each
CCCC 2012: Generating a graph• NodeXL (Hansen et al. 2010) – Open source options exist, e.g., NetworkX (http://networkx.github.com/; written in Python)• Generate graph• (Example used limited data)
Qualitative research is needed• Systematic exploration of Tweets is fine, BUT• We need qualitative research regarding Twitter account holders and their accounts – To what extent do members of candidate communities feel or believe that the candidate theoretical communities are real communities? – To what extent to those who retweet and @-reply to each other feel that those actions are constitutive of a community among them? – To what extent should we include ‘institutional accounts’?
Next steps• Different data (problems with 2012 CCCCs data; problems with my involvement) – Go for several smaller-volume hashtags to find baselines• Collaborators? – Computational linguists & computer scientists – Other writing studies researchers Brian N. Larson @Rhetoricked / www.Rhetoricked.com Department of Writing Studies, University of Minnesota