Slideshow transcript
Slide 1: Content-Based Social Network Analysis of Online Communities Anatoliy Gruzd Caroline Haythornthwaite Graduate School of Library and Information Science University of Illinois at Urbana-Champaign Social Network/ing Symposium Toronto, 2007
Slide 2: The Problem Online communities are creating a growing volume of texts contributed by a growing number of participants 100 million posters in Usenet (Marc Smith, quoted in CNET, 2003) 3.12 terrabytes of data *daily* on Usenet (2007) 2010, >70% of digital content will be user-generated, with the Growth of Usenet majority of it will still be text- (wikipedia, Oct. 2007) based (Technology Consultancy IDC)
Slide 3: Making Sense of Community Action How can we help make sense of community action and interaction based solely on textual interchanges? How can we make the social structures evident for participants, and managers or teachers? How can we make advances from linear streams of text to visualized patterns of interaction? Growth of blog activity March 2003-2006 • 175,000 new blogs a day (2006)
Slide 4: Mapping Online Communities Mappings and internal examinations tend to be based on one aspect of ties Links between sites Reports of friendship or work relations FOAF declarations With a concentration of quantity over content Flickrverse, Gustavog, 2006 http://www.flickr.com/photo_zoom.gne?id=970 8628&context=set-222111&size=l Based on 50 connections between people.
Slide 5: Mapping Online Communities (2) Welse, Emerging mappings Gleave, Fisher & include attention to Smith, 2007 Poster activity in JOSS Actor profiles as posters Content of sites e.g., words in common on different sites (Gloor & Zhao, 2006)
Slide 6: Extracting Network Information Determine who is talking to whom Applying social network analysis techniques Determine what they are talking about Applying natural language processing techniques Merge these to produce network detection that better represents ongoing processes
Slide 7: Our Goal Use natural language processing (NLP) enhance the current techniques of building social networks gain more information and insight about Nodes, Relations, and Ties Current focus is on bulletin boards Current example is online learning environment Procedures are being derived to use for groups with unknown membership
Slide 8: Adding more with NLP Revealing network information 1. Node discovery 2. Tie discovery 3. Relation discovery 4. Role & Group discovery Network visibility rather than aggregate behavior Important for revealing structures to Participants to understand the ‘lay of the (cyber)land’ and for instructors (or managers) to oversee participation and intervene as necessary
Slide 9: Adding relational information Few (yet) derive relations from content which can reveal Networks based on multiple relations Change in discourse over time Changes in associations among network members by relation and time Few deal with the vagaries of CMC texts Bulletin boards, chat Incorrect spelling, partial sentences, inventive punctuation Deriving who is talking to whom from content analysis Or local language conventions Acronyms, group naming conventions, group word use conventions, nicknames for people and processes
Slide 10: Node and Ties Focus today on nodes and tie discovery Identifying who are the actors in the network Identify nodes, i.e., people Make the tie(s) between nodes Two approaches Chain Network, based on chain of posting Name Network, based on names used in the text
Slide 11: Chain Network: definition options A B C D Connect a sender to the last person in the post 0 0 1 chain only (undirected) Connect a sender to the last and first (=thread 1 0 1 starter) person in the chain, and assign equal weight values (e.g. 1) to both ties. Same as option 2, but a tie between a sender .5 0 1 and the first person is half weight (e.g. 0.5) Connect a sender to all people in the reference .25 .5 1 chain with decreasing weights.
Slide 12: Chain Networks: missed info. Ex.1 Previous post is by Gabriel, Sam replies: ‘Nick, Ann, Gina, Gabriel: I apologize for not backing this up with a good source, but I know from reading about this topic that libraries…’ Ex.2 Previous posts by Gabriel, Sam, Gina, and Eva, then: ‘Gina, I owe you a cookie. This is exactly what I wanted to know. I was already planning on taking 302 next semester, and now I have something to look forward to!’ Post by Fred: Ex.3 ‘I wonder if that could be why other libraries around the world have resisted changing – it's too much work, and as Dan pointed out, too expensive.’
Slide 13: Name networks Making use of node and tie information that is in the text of the postings Issues Disambiguating names/nicknames from text Disambiguate names of people from names of people being discussed (e.g., subject) Detection of aliases for a given person and disambiguation of two or more users with the same name
Slide 14: Hand coding: categories Network Participants <from> = person indicated in ‘from’ line of post heading (NB only info that is system generated) <addressee> = direct reference to other ('I agree with you Todd') <reference> = indirect reference to other ('Todd has a good point') <self-reference> = poster references themselves in some way (braindead library student, high school teacher, etc.) <signature> = name as given by the message author on their post Named non-participants <subject>, <subject 2>, or <subject 3> = name is a subject of the discussion, either as one name (Dewey), 2 (Brewste Kahle) or 3 (Charles R. Darwin) <non-group reference> = reference to a person who is not in the group, nor the subject – e.g., a former professor Error <error> = new name appears because of error (e.g., Lackie as a subject instead of Leckie; or part of a prevpost line does not conform to the usual format) Previous Posts (if not removed from dataset) <previous-poster> = when the previous message is included, this indicates the poster (‘Janice wrote: ’) (system generated) <copy> = name appears because it is included with the previous message
Slide 15: Examples of hand-coding Just a note to clarify something in yesterday's lecture/chat session. I mentioned that Monday's NY Times had an article on <#1><subject> Brewster. I want to clarify that the article concerns the copyright extension law and the current Supreme Court case <#1><subject> Eldred v. <#1><subject> Ashcroft, set to begin today, I believe. <#1><subject 2> Brewster Kahle is currently touring the country in a bookmobile … For more info on this … you can refer to the Web site that <#1><reference> Jodie mentioned yesterday… <#1><signature>LA NB. Jodie may not even appear in the contributors to this thread Several of our programs at UC <#7><subject> Davis have well-intentioned lower division research methods classes that introduce then never reinforce basic skills. Need to disambiguate “UC Davis” from someone called “Davis” Research (to paraphrase my hero, <#8><subject> Shrek) is like onions. Not because it stinks, but because it is made up of layers. “Shrek” as a name will not appear in conventional name lists.
Slide 16: Automated Node & Tie Discovery Method Determine names in the dataset, and 1. assign a probability value Determine email address to name 2. relationship Assign tie weight to each discovered tie 3.
Slide 17: Automated Node Discovery Named Entities Recognition Discovery of personal names The 1990 US Census http://www.census.gov/genealogy/names Capitalization Distinguishing between names of people in and outside the class Having a list of names doesn’t always work e.g., if someone uses their middle name which is not on the name list, or they use a short or nickname; Method: associate names with email addresses in the class relying on content-based (e.g. context words) and structure-based (e.g. word position) features of names Issues Many names - same person Same name - many people
Slide 18: Automated Node Discovery (2) EXAMPLE From: wilma@bedrock.us (=Wilma) Reference Chain: tank123@gl.edu (=Dustin) => hle@gl.edu (=Sam) Hi Dustin, Sam, Nick and all, I appreciate your posts from this and last week […]. I keep thinking of poor Charlie who only wanted information on “dogs“ Sam has been talking about. […] Wilma. Words Name Words Position Score Score to the Left to the Right % “TO” “FROM” * Hi Dustin Sam, Nick, 0 0.322 -0.004 * Dustin, Sam Nick and 1 0.321 -0.002 Dustin, Sam, Nick and all, 2 0.320 -0.001 of poor Charlie who only 50 0.05 0.04 on “dogs“ Sam has been 65 0.285 0.07 * Wilma * 88 0.0012 0.116 * - end of the line
Slide 19: Automated Tie Discovery Associate each sender in the class with all names mentioned in his/her emails. For example, Wilma ---> Dustin = tank123@gl.edu Wilma ---> Charlie no email for Charlie, so not a person in the conversation group (e.g., when Steve and I took Professor Sid’s course last year) Wilma ---> no mention of a name; info on tie is only in the Chain network; could start of a thread or change of topic within a thread, or a general posting Assign tie weight Pair counts Mutual information
Slide 20: Chain vs. Name Networks Get added information from the name network Ex. BBoards #06,07,08 Nodes: 37 Messages: 346 Chain network ties: 223 Name network ties: 215 / 429 Shared ties: 140 QAP Pearson Correlation: 0.453 (p = .000)
Slide 21: An ego network for Brent Name Network Chain Network Visualization powered by http://www.netvis.org
Slide 22: An ego network for Tyler Name Network Chain Network kurt -> Kurt Cobain, a lead singer for the rock band Nirvana dewey -> John Dewey, philosopher & educator santa_monica -> Santa Monica Public Library mark –> mark up language Visualization powered by http://www.netvis.org
Slide 23: Conclusion Uses and benefits of content-based networks Discovery of social network behavior rather than posting behavior Discovery of social interactions between group members that happened outside the group (e.g. fishing trip) Discovery of relations between group members and people outside the group (e.g. a shared friend from another department) Expert/Co-discussant discovery Study of perceived social networks without directly collecting survey-data from participants (?)
Slide 24: References and Further Reading Related papers Haythornthwaite, C. & Gruzd, A. (2007). A noun phrase analysis tool for mining online community. In C. Steinfield, B.T. Pentland, M. Ackerman & N. Contractor (eds.). Communities and Technologies 2007 (pp. 67-86). London: Springer. Howard T. Welser, Eric Gleave, Danyel Fisher, and Marc Smith (2007) Visualizing the signatures of social roles in online discussion groups. Journal of Social Structure, 8(2). http://www.cmu.edu/joss/content/articles/volume8/Welser/




Add a comment on Slide 1
If you have a SlideShare account, login to comment; else you can comment as a guest- Favorites & Groups
Showing 1-50 of 0 (more)