Your SlideShare is downloading. ×
Content-Based Social Network Analysis of Online Communities
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Content-Based Social Network Analysis of Online Communities


Published on

Anatoliy Gruzd and Caroline Haythornthwaite (U of Illinois) …

Anatoliy Gruzd and Caroline Haythornthwaite (U of Illinois)
Social Network/ing Symposium,
Toronto, 2007

Published in: Technology

1 Comment
No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide
  • Transcript

    • 1. C ontent-Based Social Network Analysis of Online Communities Anatoliy Gruzd Caroline Haythornthwaite Graduate School of Library and Information Science University of Illinois at Urbana-Champaign Social Network/ing Symposium Toronto, 2007
    • 2. The Problem
      • Online communities are creating a growing volume of texts contributed by a growing number of participants
        • 100 million posters in Usenet (Marc Smith, quoted in CNET, 2003)
        • 3.12 terrabytes of data * daily * on Usenet (2007)
        • 2010, > 70% of digital content will be user-generated, with the majority of it will still be text-based (T echnology C onsultancy IDC )
      Growth of Usenet (wikipedia, Oct. 2007)
    • 3. Making Sense of Community Action
      • How can we help make sense of community action and interaction based solely on textual interchanges?
        • How can we make the social structures evident for participants, and managers or teachers?
        • How can we make advances from linear streams of text to visualized patterns of interaction?
      • Growth of blog activity March 2003-2006
      • 175,000 new blogs a day (2006)
    • 4. Mapping Online Communities
      • Mappings and internal examinations tend to be based on one aspect of ties
        • Links between sites
        • Reports of friendship or work relations
        • FOAF declarations
      • With a concentration of quantity over content
      Flickrverse, Gustavog, 2006 Based on 50 connections between people.
    • 5. Mapping Online Communities (2)
      • Emerging mappings include attention to
        • Poster activity
        • Actor profiles as posters
        • Content of sites
          • e.g., words in common on different sites (Gloor & Zhao, 2006)
      Welse, Gleave, Fisher & Smith, 2007 in JOSS
    • 6. Extracting Network Information
      • Determine who is talking to whom
        • Applying social network analysis techniques
      • Determine what they are talking about
        • Applying natural language processing techniques
      • Merge these to produce network detection that better represents ongoing processes
    • 7. Our Goal
      • Use natural language processing (NLP)
        • enhance the current techniques of building social networks
        • gain more information and insight about Nodes, Relations, and Ties
      • Current focus is on bulletin boards
        • Current example is online learning environment
        • Procedures are being derived to use for groups with unknown membership
    • 8. Adding more with NLP
      • Revealing network information
        • 1. Node discovery
        • 2. Tie discovery
        • 3. Relation discovery
        • 4. Role & Group discovery
      • Network visibility rather than aggregate behavior
      • Important for revealing structures to
        • Participants to understand the ‘lay of the (cyber)land’ and for instructors (or managers) to oversee participation and intervene as necessary
    • 9. Adding relational information
      • Few (yet) derive relations from content which can reveal
        • Networks based on multiple relations
        • Change in discourse over time
        • Changes in associations among network members by relation and time
      • Few deal with the vagaries of CMC texts
        • Bulletin boards, chat
        • Incorrect spelling, partial sentences, inventive punctuation
        • Deriving who is talking to whom from content analysis
      • Or local language conventions
        • Acronyms, group naming conventions, group word use conventions, nicknames for people and processes
    • 10. Node and Ties
      • Focus today on nodes and tie discovery
      • Identifying who are the actors in the network
        • Identify nodes, i.e., people
        • Make the tie(s) between nodes
      • Two approaches
        • Chain Network, based on chain of posting
        • Name Network, based on names used in the text
    • 11. Chain Network: definition options A B C D 1 1 1 1 .5 0 0 0 .25 .5 1 0 Connect a sender to all people in the reference chain with decreasing weights. Same as option 2, but a tie between a sender and the first person is half weight (e.g. 0.5) Connect a sender to the last and first (=thread starter) person in the chain, and assign equal weight values (e.g. 1) to both ties. Connect a sender to the last person in the post chain only (undirected)
    • 12. Chain Networks: missed info. Previous post is by Gabriel, Sam replies: ‘ Nick, Ann, Gina, Gabriel : I apologize for not backing this up with a good source, but I know from reading about this topic that libraries…’ Previous posts by Gabriel, Sam, Gina, and Eva, then: ‘ Gina , I owe you a cookie. This is exactly what I wanted to know. I was already planning on taking 302 next semester, and now I have something to look forward to!’
        • Post by Fred :
        • ‘ I wonder if that could be why other libraries
        • around the world have resisted changing –
        • it's too much work, and as Dan pointed out, too expensive.’
      Ex.1 Ex.2 Ex.3
    • 13. Name networks
      • Making use of node and tie information that is in the text of the postings
      • Issues
        • Disambiguating names/nicknames from text
        • Disambiguate names of people from names of people being discussed (e.g., subject)
        • Detection of aliases for a given person and disambiguation of two or more users with the same name
    • 14. Hand coding: categories
      • Network Participants
        • <from> = person indicated in ‘from’ line of post heading ( NB only info that is system generated )
        • <addressee> = direct reference to other ('I agree with you Todd')
        • <reference> = indirect reference to other ('Todd has a good point')
        • <self-reference> = poster references themselves in some way (braindead library student, high school teacher, etc.)
        • <signature> = name as given by the message author on their post
      • Named non-participants
        • <subject>, <subject 2>, or <subject 3> = name is a subject of the discussion, either as one name (Dewey), 2 (Brewste Kahle) or 3 (Charles R. Darwin)
        • <non-group reference> = reference to a person who is not in the group, nor the subject – e.g., a former professor
      • Error
        • <error> = new name appears because of error (e.g., Lackie as a subject instead of Leckie; or part of a prevpost line does not conform to the usual format)
      • Previous Posts (if not removed from dataset)
        • <previous-poster> = when the previous message is included, this indicates the poster (‘Janice wrote: ’) (system generated)
        • <copy> = name appears because it is included with the previous message
    • 15. Examples of hand-coding
      • Just a note to clarify something in yesterday's lecture/chat session. I mentioned that Monday's NY Times had an article on <#1><subject> Brewster . I want to clarify that the article concerns the copyright extension law and the current Supreme Court case <#1><subject> Eldred v. <#1><subject> Ashcroft , set to begin today, I believe. <#1><subject 2> Brewster Kahle is currently touring the country in a bookmobile … For more info on this … you can refer to the Web site that <#1><reference> Jodie mentioned yesterday… < #1><signature> LA
        • NB. Jodie may not even appear in the contributors to this thread
      • Several of our programs at UC <#7><subject> Davis have well-intentioned lower division research methods classes that introduce then never reinforce basic skills.
        • Need to disambiguate “UC Davis” from someone called “Davis”
      • Research (to paraphrase my hero, <#8><subject> Shrek ) is like onions. Not because it stinks, but because it is made up of layers.
        • “ Shrek” as a name will not appear in conventional name lists.
    • 16. Automated Node & Tie Discovery
      • Method
        • Determine names in the dataset, and assign a probability value
        • Determine email address to name relationship
        • Assign tie weight to each discovered tie
    • 17. Automated Node Discovery
      • Named Entities Recognition
        • Discovery of personal names
          • The 1990 US Census http://
          • Capitalization
      • Distinguishing between names of people in and outside the class
        • Having a list of names doesn’t always work
          • e.g., if someone uses their middle name which is not on the name list, or they use a short or nickname;
        • Method: associate names with email addresses in the class
          • relying on content-based (e.g. context words) and structure-based (e.g. word position) features of names
        • Issues
          • Many names - same person
          • Same name - many people
    • 18. Automated Node Discovery (2)
      • EXAMPLE
      • From : [email_address] (= Wilma )
      • Reference Chain : [email_address] (= Dustin ) => [email_address] (= S am )
      • Hi Dustin , Sam , Nick and all, I appreciate your posts from this and last week […]. I keep thinking of poor Charlie who only wanted information on “dogs“ Sam has been talking about. […] Wilma .
      * - end of the line 0.116   0.0012 88 * Wilma * 0.07 0.285 65 has been Sam on “dogs“ 0.04 0.05 50 who only Charlie of poor -0.001 0.320 2 and all, Nick Dustin, Sam, -0.002 0.321 1 Nick and Sam * Dustin, -0.004 0.322 0 Sam, Nick, Dustin * Hi Score “FROM” Score “TO” Position % Words to the Right Name Words to the Left
    • 19. Automated Tie Discovery
      • Associate each sender in the class with all names mentioned in his/her emails. For example,
        • Wilma ---> Dustin = [email_address]
        • Wilma ---> Charlie
          • no email for Charlie, so not a person in the conversation group (e.g., when Steve and I took Professor Sid’s course last year)
        • Wilma --->
          • no mention of a name; info on tie is only in the Chain network; could start of a thread or change of topic within a thread, or a general posting
      • Assign tie weight
        • Pair counts
        • Mutual information
    • 20. Chain vs. Name Networks
      • Get added information from the name network
        • Ex. BBoards #06,07,08
        • Nodes: 37 Messages: 346
        • Chain network ties: 223
        • Name network ties: 215 / 429
        • Shared ties: 140
        • QAP Pearson Correlation: 0.453 ( p = .000)
    • 21. An ego network for Brent Visualization powered by Name Network Chain Network
    • 22. An ego network for Tyler Name Network Chain Network kurt -> Kurt Cobain , a lead singer for the rock band Nirvana dewey -> John Dewey , philosopher & educator santa_monica -> Santa Monica Public Library mark –> mark up language Visualization powered by
    • 23. Conclusion
      • Uses and benefits of content-based networks
        • Discovery of social network behavior rather than posting behavior
        • Discovery of social interactions between group members that happened outside the group (e.g. fishing trip)
        • Discovery of relations between group members and people outside the group (e.g. a shared friend from another department)
        • Expert/Co-discussant discovery
        • Study of perceived social networks without directly collecting survey-data from participants (?)
    • 24. References and Further Reading
      • Related papers
        • Haythornthwaite, C. & Gruzd, A. (2007). A noun phrase analysis tool for mining online community. In C. Steinfield, B.T. Pentland, M. Ackerman & N. Contractor (eds.). Communities and Technologies 2007 (pp. 67-86). London: Springer.
        • Howard T. Welser, Eric Gleave, Danyel Fisher, and Marc Smith (2007) Visualizing the signatures of social roles in online discussion groups. Journal of Social Structure, 8(2).