Content-Based Social Network Analysis of Online Communities


Published on

Anatoliy Gruzd and Caroline Haythornthwaite (U of Illinois)
Social Network/ing Symposium,
Toronto, 2007

Published in: Technology

Content-Based Social Network Analysis of Online Communities

  1. 1. C ontent-Based Social Network Analysis of Online Communities Anatoliy Gruzd Caroline Haythornthwaite Graduate School of Library and Information Science University of Illinois at Urbana-Champaign Social Network/ing Symposium Toronto, 2007
  2. 2. The Problem <ul><li>Online communities are creating a growing volume of texts contributed by a growing number of participants </li></ul><ul><ul><li>100 million posters in Usenet (Marc Smith, quoted in CNET, 2003) </li></ul></ul><ul><ul><li>3.12 terrabytes of data * daily * on Usenet (2007) </li></ul></ul><ul><ul><li>2010, > 70% of digital content will be user-generated, with the majority of it will still be text-based (T echnology C onsultancy IDC ) </li></ul></ul>Growth of Usenet (wikipedia, Oct. 2007)
  3. 3. Making Sense of Community Action <ul><li>How can we help make sense of community action and interaction based solely on textual interchanges? </li></ul><ul><ul><li>How can we make the social structures evident for participants, and managers or teachers? </li></ul></ul><ul><ul><li>How can we make advances from linear streams of text to visualized patterns of interaction? </li></ul></ul><ul><li>Growth of blog activity March 2003-2006 </li></ul><ul><li>175,000 new blogs a day (2006) </li></ul>
  4. 4. Mapping Online Communities <ul><li>Mappings and internal examinations tend to be based on one aspect of ties </li></ul><ul><ul><li>Links between sites </li></ul></ul><ul><ul><li>Reports of friendship or work relations </li></ul></ul><ul><ul><li>FOAF declarations </li></ul></ul><ul><li>With a concentration of quantity over content </li></ul>Flickrverse, Gustavog, 2006 Based on 50 connections between people.
  5. 5. Mapping Online Communities (2) <ul><li>Emerging mappings include attention to </li></ul><ul><ul><li>Poster activity </li></ul></ul><ul><ul><li>Actor profiles as posters </li></ul></ul><ul><ul><li>Content of sites </li></ul></ul><ul><ul><ul><li>e.g., words in common on different sites (Gloor & Zhao, 2006) </li></ul></ul></ul>Welse, Gleave, Fisher & Smith, 2007 in JOSS
  6. 6. Extracting Network Information <ul><li>Determine who is talking to whom </li></ul><ul><ul><li>Applying social network analysis techniques </li></ul></ul><ul><li>Determine what they are talking about </li></ul><ul><ul><li>Applying natural language processing techniques </li></ul></ul><ul><li>Merge these to produce network detection that better represents ongoing processes </li></ul>
  7. 7. Our Goal <ul><li>Use natural language processing (NLP) </li></ul><ul><ul><li>enhance the current techniques of building social networks </li></ul></ul><ul><ul><li>gain more information and insight about Nodes, Relations, and Ties </li></ul></ul><ul><li>Current focus is on bulletin boards </li></ul><ul><ul><li>Current example is online learning environment </li></ul></ul><ul><ul><li>Procedures are being derived to use for groups with unknown membership </li></ul></ul>
  8. 8. Adding more with NLP <ul><li>Revealing network information </li></ul><ul><ul><li>1. Node discovery </li></ul></ul><ul><ul><li>2. Tie discovery </li></ul></ul><ul><ul><li>3. Relation discovery </li></ul></ul><ul><ul><li>4. Role & Group discovery </li></ul></ul><ul><li>Network visibility rather than aggregate behavior </li></ul><ul><li>Important for revealing structures to </li></ul><ul><ul><li>Participants to understand the ‘lay of the (cyber)land’ and for instructors (or managers) to oversee participation and intervene as necessary </li></ul></ul>
  9. 9. Adding relational information <ul><li>Few (yet) derive relations from content which can reveal </li></ul><ul><ul><li>Networks based on multiple relations </li></ul></ul><ul><ul><li>Change in discourse over time </li></ul></ul><ul><ul><li>Changes in associations among network members by relation and time </li></ul></ul><ul><li>Few deal with the vagaries of CMC texts </li></ul><ul><ul><li>Bulletin boards, chat </li></ul></ul><ul><ul><li>Incorrect spelling, partial sentences, inventive punctuation </li></ul></ul><ul><ul><li>Deriving who is talking to whom from content analysis </li></ul></ul><ul><li>Or local language conventions </li></ul><ul><ul><li>Acronyms, group naming conventions, group word use conventions, nicknames for people and processes </li></ul></ul>
  10. 10. Node and Ties <ul><li>Focus today on nodes and tie discovery </li></ul><ul><li>Identifying who are the actors in the network </li></ul><ul><ul><li>Identify nodes, i.e., people </li></ul></ul><ul><ul><li>Make the tie(s) between nodes </li></ul></ul><ul><li>Two approaches </li></ul><ul><ul><li>Chain Network, based on chain of posting </li></ul></ul><ul><ul><li>Name Network, based on names used in the text </li></ul></ul>
  11. 11. Chain Network: definition options A B C D 1 1 1 1 .5 0 0 0 .25 .5 1 0 Connect a sender to all people in the reference chain with decreasing weights. Same as option 2, but a tie between a sender and the first person is half weight (e.g. 0.5) Connect a sender to the last and first (=thread starter) person in the chain, and assign equal weight values (e.g. 1) to both ties. Connect a sender to the last person in the post chain only (undirected)
  12. 12. Chain Networks: missed info. Previous post is by Gabriel, Sam replies: ‘ Nick, Ann, Gina, Gabriel : I apologize for not backing this up with a good source, but I know from reading about this topic that libraries…’ Previous posts by Gabriel, Sam, Gina, and Eva, then: ‘ Gina , I owe you a cookie. This is exactly what I wanted to know. I was already planning on taking 302 next semester, and now I have something to look forward to!’ <ul><ul><li>Post by Fred : </li></ul></ul><ul><ul><li>‘ I wonder if that could be why other libraries </li></ul></ul><ul><ul><li>around the world have resisted changing – </li></ul></ul><ul><ul><li>it's too much work, and as Dan pointed out, too expensive.’ </li></ul></ul>Ex.1 Ex.2 Ex.3
  13. 13. Name networks <ul><li>Making use of node and tie information that is in the text of the postings </li></ul><ul><li>Issues </li></ul><ul><ul><li>Disambiguating names/nicknames from text </li></ul></ul><ul><ul><li>Disambiguate names of people from names of people being discussed (e.g., subject) </li></ul></ul><ul><ul><li>Detection of aliases for a given person and disambiguation of two or more users with the same name </li></ul></ul>
  14. 14. Hand coding: categories <ul><li>Network Participants </li></ul><ul><ul><li><from> = person indicated in ‘from’ line of post heading ( NB only info that is system generated ) </li></ul></ul><ul><ul><li><addressee> = direct reference to other ('I agree with you Todd') </li></ul></ul><ul><ul><li><reference> = indirect reference to other ('Todd has a good point') </li></ul></ul><ul><ul><li><self-reference> = poster references themselves in some way (braindead library student, high school teacher, etc.) </li></ul></ul><ul><ul><li><signature> = name as given by the message author on their post </li></ul></ul><ul><li>Named non-participants </li></ul><ul><ul><li><subject>, <subject 2>, or <subject 3> = name is a subject of the discussion, either as one name (Dewey), 2 (Brewste Kahle) or 3 (Charles R. Darwin) </li></ul></ul><ul><ul><li><non-group reference> = reference to a person who is not in the group, nor the subject – e.g., a former professor </li></ul></ul><ul><li>Error </li></ul><ul><ul><li><error> = new name appears because of error (e.g., Lackie as a subject instead of Leckie; or part of a prevpost line does not conform to the usual format) </li></ul></ul><ul><li>Previous Posts (if not removed from dataset) </li></ul><ul><ul><li><previous-poster> = when the previous message is included, this indicates the poster (‘Janice wrote: ’) (system generated) </li></ul></ul><ul><ul><li><copy> = name appears because it is included with the previous message </li></ul></ul>
  15. 15. Examples of hand-coding <ul><li>Just a note to clarify something in yesterday's lecture/chat session. I mentioned that Monday's NY Times had an article on <#1><subject> Brewster . I want to clarify that the article concerns the copyright extension law and the current Supreme Court case <#1><subject> Eldred v. <#1><subject> Ashcroft , set to begin today, I believe. <#1><subject 2> Brewster Kahle is currently touring the country in a bookmobile … For more info on this … you can refer to the Web site that <#1><reference> Jodie mentioned yesterday… < #1><signature> LA </li></ul><ul><ul><li>NB. Jodie may not even appear in the contributors to this thread </li></ul></ul><ul><li>Several of our programs at UC <#7><subject> Davis have well-intentioned lower division research methods classes that introduce then never reinforce basic skills. </li></ul><ul><ul><li>Need to disambiguate “UC Davis” from someone called “Davis” </li></ul></ul><ul><li>Research (to paraphrase my hero, <#8><subject> Shrek ) is like onions. Not because it stinks, but because it is made up of layers. </li></ul><ul><ul><li>“ Shrek” as a name will not appear in conventional name lists. </li></ul></ul>
  16. 16. Automated Node & Tie Discovery <ul><li>Method </li></ul><ul><ul><li>Determine names in the dataset, and assign a probability value </li></ul></ul><ul><ul><li>Determine email address to name relationship </li></ul></ul><ul><ul><li>Assign tie weight to each discovered tie </li></ul></ul>
  17. 17. Automated Node Discovery <ul><li>Named Entities Recognition </li></ul><ul><ul><li>Discovery of personal names </li></ul></ul><ul><ul><ul><li>The 1990 US Census http:// </li></ul></ul></ul><ul><ul><ul><li>Capitalization </li></ul></ul></ul><ul><li>Distinguishing between names of people in and outside the class </li></ul><ul><ul><li>Having a list of names doesn’t always work </li></ul></ul><ul><ul><ul><li>e.g., if someone uses their middle name which is not on the name list, or they use a short or nickname; </li></ul></ul></ul><ul><ul><li>Method: associate names with email addresses in the class </li></ul></ul><ul><ul><ul><li>relying on content-based (e.g. context words) and structure-based (e.g. word position) features of names </li></ul></ul></ul><ul><ul><li>Issues </li></ul></ul><ul><ul><ul><li>Many names - same person </li></ul></ul></ul><ul><ul><ul><li>Same name - many people </li></ul></ul></ul>
  18. 18. Automated Node Discovery (2) <ul><li>EXAMPLE </li></ul><ul><li>From : [email_address] (= Wilma ) </li></ul><ul><li>Reference Chain : [email_address] (= Dustin ) => [email_address] (= S am ) </li></ul><ul><li>Hi Dustin , Sam , Nick and all, I appreciate your posts from this and last week […]. I keep thinking of poor Charlie who only wanted information on “dogs“ Sam has been talking about. […] Wilma . </li></ul>* - end of the line 0.116   0.0012 88 * Wilma * 0.07 0.285 65 has been Sam on “dogs“ 0.04 0.05 50 who only Charlie of poor -0.001 0.320 2 and all, Nick Dustin, Sam, -0.002 0.321 1 Nick and Sam * Dustin, -0.004 0.322 0 Sam, Nick, Dustin * Hi Score “FROM” Score “TO” Position % Words to the Right Name Words to the Left
  19. 19. Automated Tie Discovery <ul><li>Associate each sender in the class with all names mentioned in his/her emails. For example, </li></ul><ul><ul><li>Wilma ---> Dustin = [email_address] </li></ul></ul><ul><ul><li>Wilma ---> Charlie </li></ul></ul><ul><ul><ul><li>no email for Charlie, so not a person in the conversation group (e.g., when Steve and I took Professor Sid’s course last year) </li></ul></ul></ul><ul><ul><li>Wilma ---> </li></ul></ul><ul><ul><ul><li>no mention of a name; info on tie is only in the Chain network; could start of a thread or change of topic within a thread, or a general posting </li></ul></ul></ul><ul><li>Assign tie weight </li></ul><ul><ul><li>Pair counts </li></ul></ul><ul><ul><li>Mutual information </li></ul></ul>
  20. 20. Chain vs. Name Networks <ul><li>Get added information from the name network </li></ul><ul><ul><li>Ex. BBoards #06,07,08 </li></ul></ul><ul><ul><li>Nodes: 37 Messages: 346 </li></ul></ul><ul><ul><li>Chain network ties: 223 </li></ul></ul><ul><ul><li>Name network ties: 215 / 429 </li></ul></ul><ul><ul><li>Shared ties: 140 </li></ul></ul><ul><ul><li>QAP Pearson Correlation: 0.453 ( p = .000) </li></ul></ul>
  21. 21. An ego network for Brent Visualization powered by Name Network Chain Network
  22. 22. An ego network for Tyler Name Network Chain Network kurt -> Kurt Cobain , a lead singer for the rock band Nirvana dewey -> John Dewey , philosopher & educator santa_monica -> Santa Monica Public Library mark –> mark up language Visualization powered by
  23. 23. Conclusion <ul><li>Uses and benefits of content-based networks </li></ul><ul><ul><li>Discovery of social network behavior rather than posting behavior </li></ul></ul><ul><ul><li>Discovery of social interactions between group members that happened outside the group (e.g. fishing trip) </li></ul></ul><ul><ul><li>Discovery of relations between group members and people outside the group (e.g. a shared friend from another department) </li></ul></ul><ul><ul><li>Expert/Co-discussant discovery </li></ul></ul><ul><ul><li>Study of perceived social networks without directly collecting survey-data from participants (?) </li></ul></ul>
  24. 24. References and Further Reading <ul><li>Related papers </li></ul><ul><ul><li>Haythornthwaite, C. & Gruzd, A. (2007). A noun phrase analysis tool for mining online community. In C. Steinfield, B.T. Pentland, M. Ackerman & N. Contractor (eds.). Communities and Technologies 2007 (pp. 67-86). London: Springer. </li></ul></ul><ul><ul><li>Howard T. Welser, Eric Gleave, Danyel Fisher, and Marc Smith (2007) Visualizing the signatures of social roles in online discussion groups. Journal of Social Structure, 8(2). </li></ul></ul>