Your SlideShare is downloading. ×
Content-Based Social Network Analysis of Online Communities
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Content-Based Social Network Analysis of Online Communities

5,390
views

Published on

Anatoliy Gruzd and Caroline Haythornthwaite (U of Illinois) …

Anatoliy Gruzd and Caroline Haythornthwaite (U of Illinois)
Social Network/ing Symposium,
Toronto, 2007

Published in: Technology

1 Comment
8 Likes
Statistics
Notes
No Downloads
Views
Total Views
5,390
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
199
Comments
1
Likes
8
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Transcript

    • 1. C ontent-Based Social Network Analysis of Online Communities Anatoliy Gruzd Caroline Haythornthwaite Graduate School of Library and Information Science University of Illinois at Urbana-Champaign Social Network/ing Symposium Toronto, 2007
    • 2. The Problem
      • Online communities are creating a growing volume of texts contributed by a growing number of participants
        • 100 million posters in Usenet (Marc Smith, quoted in CNET, 2003)
        • 3.12 terrabytes of data * daily * on Usenet (2007)
        • 2010, > 70% of digital content will be user-generated, with the majority of it will still be text-based (T echnology C onsultancy IDC )
      Growth of Usenet (wikipedia, Oct. 2007)
    • 3. Making Sense of Community Action
      • How can we help make sense of community action and interaction based solely on textual interchanges?
        • How can we make the social structures evident for participants, and managers or teachers?
        • How can we make advances from linear streams of text to visualized patterns of interaction?
      • Growth of blog activity March 2003-2006
      • 175,000 new blogs a day (2006)
    • 4. Mapping Online Communities
      • Mappings and internal examinations tend to be based on one aspect of ties
        • Links between sites
        • Reports of friendship or work relations
        • FOAF declarations
      • With a concentration of quantity over content
      Flickrverse, Gustavog, 2006 http://www.flickr.com/photo_zoom.gne?id=9708628&context=set-222111&size=l Based on 50 connections between people.
    • 5. Mapping Online Communities (2)
      • Emerging mappings include attention to
        • Poster activity
        • Actor profiles as posters
        • Content of sites
          • e.g., words in common on different sites (Gloor & Zhao, 2006)
      Welse, Gleave, Fisher & Smith, 2007 in JOSS
    • 6. Extracting Network Information
      • Determine who is talking to whom
        • Applying social network analysis techniques
      • Determine what they are talking about
        • Applying natural language processing techniques
      • Merge these to produce network detection that better represents ongoing processes
    • 7. Our Goal
      • Use natural language processing (NLP)
        • enhance the current techniques of building social networks
        • gain more information and insight about Nodes, Relations, and Ties
      • Current focus is on bulletin boards
        • Current example is online learning environment
        • Procedures are being derived to use for groups with unknown membership
    • 8. Adding more with NLP
      • Revealing network information
        • 1. Node discovery
        • 2. Tie discovery
        • 3. Relation discovery
        • 4. Role & Group discovery
      • Network visibility rather than aggregate behavior
      • Important for revealing structures to
        • Participants to understand the ‘lay of the (cyber)land’ and for instructors (or managers) to oversee participation and intervene as necessary
    • 9. Adding relational information
      • Few (yet) derive relations from content which can reveal
        • Networks based on multiple relations
        • Change in discourse over time
        • Changes in associations among network members by relation and time
      • Few deal with the vagaries of CMC texts
        • Bulletin boards, chat
        • Incorrect spelling, partial sentences, inventive punctuation
        • Deriving who is talking to whom from content analysis
      • Or local language conventions
        • Acronyms, group naming conventions, group word use conventions, nicknames for people and processes
    • 10. Node and Ties
      • Focus today on nodes and tie discovery
      • Identifying who are the actors in the network
        • Identify nodes, i.e., people
        • Make the tie(s) between nodes
      • Two approaches
        • Chain Network, based on chain of posting
        • Name Network, based on names used in the text
    • 11. Chain Network: definition options A B C D 1 1 1 1 .5 0 0 0 .25 .5 1 0 Connect a sender to all people in the reference chain with decreasing weights. Same as option 2, but a tie between a sender and the first person is half weight (e.g. 0.5) Connect a sender to the last and first (=thread starter) person in the chain, and assign equal weight values (e.g. 1) to both ties. Connect a sender to the last person in the post chain only (undirected)
    • 12. Chain Networks: missed info. Previous post is by Gabriel, Sam replies: ‘ Nick, Ann, Gina, Gabriel : I apologize for not backing this up with a good source, but I know from reading about this topic that libraries…’ Previous posts by Gabriel, Sam, Gina, and Eva, then: ‘ Gina , I owe you a cookie. This is exactly what I wanted to know. I was already planning on taking 302 next semester, and now I have something to look forward to!’
        • Post by Fred :
        • ‘ I wonder if that could be why other libraries
        • around the world have resisted changing –
        • it's too much work, and as Dan pointed out, too expensive.’
      Ex.1 Ex.2 Ex.3
    • 13. Name networks
      • Making use of node and tie information that is in the text of the postings
      • Issues
        • Disambiguating names/nicknames from text
        • Disambiguate names of people from names of people being discussed (e.g., subject)
        • Detection of aliases for a given person and disambiguation of two or more users with the same name
    • 14. Hand coding: categories
      • Network Participants
        • <from> = person indicated in ‘from’ line of post heading ( NB only info that is system generated )
        • <addressee> = direct reference to other ('I agree with you Todd')
        • <reference> = indirect reference to other ('Todd has a good point')
        • <self-reference> = poster references themselves in some way (braindead library student, high school teacher, etc.)
        • <signature> = name as given by the message author on their post
      • Named non-participants
        • <subject>, <subject 2>, or <subject 3> = name is a subject of the discussion, either as one name (Dewey), 2 (Brewste Kahle) or 3 (Charles R. Darwin)
        • <non-group reference> = reference to a person who is not in the group, nor the subject – e.g., a former professor
      • Error
        • <error> = new name appears because of error (e.g., Lackie as a subject instead of Leckie; or part of a prevpost line does not conform to the usual format)
      • Previous Posts (if not removed from dataset)
        • <previous-poster> = when the previous message is included, this indicates the poster (‘Janice wrote: ’) (system generated)
        • <copy> = name appears because it is included with the previous message
    • 15. Examples of hand-coding
      • Just a note to clarify something in yesterday's lecture/chat session. I mentioned that Monday's NY Times had an article on <#1><subject> Brewster . I want to clarify that the article concerns the copyright extension law and the current Supreme Court case <#1><subject> Eldred v. <#1><subject> Ashcroft , set to begin today, I believe. <#1><subject 2> Brewster Kahle is currently touring the country in a bookmobile … For more info on this … you can refer to the Web site that <#1><reference> Jodie mentioned yesterday… < #1><signature> LA
        • NB. Jodie may not even appear in the contributors to this thread
      • Several of our programs at UC <#7><subject> Davis have well-intentioned lower division research methods classes that introduce then never reinforce basic skills.
        • Need to disambiguate “UC Davis” from someone called “Davis”
      • Research (to paraphrase my hero, <#8><subject> Shrek ) is like onions. Not because it stinks, but because it is made up of layers.
        • “ Shrek” as a name will not appear in conventional name lists.
    • 16. Automated Node & Tie Discovery
      • Method
        • Determine names in the dataset, and assign a probability value
        • Determine email address to name relationship
        • Assign tie weight to each discovered tie
    • 17. Automated Node Discovery
      • Named Entities Recognition
        • Discovery of personal names
          • The 1990 US Census http:// www.census.gov/genealogy/names
          • Capitalization
      • Distinguishing between names of people in and outside the class
        • Having a list of names doesn’t always work
          • e.g., if someone uses their middle name which is not on the name list, or they use a short or nickname;
        • Method: associate names with email addresses in the class
          • relying on content-based (e.g. context words) and structure-based (e.g. word position) features of names
        • Issues
          • Many names - same person
          • Same name - many people
    • 18. Automated Node Discovery (2)
      • EXAMPLE
      • From : [email_address] (= Wilma )
      • Reference Chain : [email_address] (= Dustin ) => [email_address] (= S am )
      • Hi Dustin , Sam , Nick and all, I appreciate your posts from this and last week […]. I keep thinking of poor Charlie who only wanted information on “dogs“ Sam has been talking about. […] Wilma .
      * - end of the line 0.116   0.0012 88 * Wilma * 0.07 0.285 65 has been Sam on “dogs“ 0.04 0.05 50 who only Charlie of poor -0.001 0.320 2 and all, Nick Dustin, Sam, -0.002 0.321 1 Nick and Sam * Dustin, -0.004 0.322 0 Sam, Nick, Dustin * Hi Score “FROM” Score “TO” Position % Words to the Right Name Words to the Left
    • 19. Automated Tie Discovery
      • Associate each sender in the class with all names mentioned in his/her emails. For example,
        • Wilma ---> Dustin = [email_address]
        • Wilma ---> Charlie
          • no email for Charlie, so not a person in the conversation group (e.g., when Steve and I took Professor Sid’s course last year)
        • Wilma --->
          • no mention of a name; info on tie is only in the Chain network; could start of a thread or change of topic within a thread, or a general posting
      • Assign tie weight
        • Pair counts
        • Mutual information
    • 20. Chain vs. Name Networks
      • Get added information from the name network
        • Ex. BBoards #06,07,08
        • Nodes: 37 Messages: 346
        • Chain network ties: 223
        • Name network ties: 215 / 429
        • Shared ties: 140
        • QAP Pearson Correlation: 0.453 ( p = .000)
    • 21. An ego network for Brent Visualization powered by http://www.netvis.org Name Network Chain Network
    • 22. An ego network for Tyler Name Network Chain Network kurt -> Kurt Cobain , a lead singer for the rock band Nirvana dewey -> John Dewey , philosopher & educator santa_monica -> Santa Monica Public Library mark –> mark up language Visualization powered by http://www.netvis.org
    • 23. Conclusion
      • Uses and benefits of content-based networks
        • Discovery of social network behavior rather than posting behavior
        • Discovery of social interactions between group members that happened outside the group (e.g. fishing trip)
        • Discovery of relations between group members and people outside the group (e.g. a shared friend from another department)
        • Expert/Co-discussant discovery
        • Study of perceived social networks without directly collecting survey-data from participants (?)
    • 24. References and Further Reading
      • Related papers
        • Haythornthwaite, C. & Gruzd, A. (2007). A noun phrase analysis tool for mining online community. In C. Steinfield, B.T. Pentland, M. Ackerman & N. Contractor (eds.). Communities and Technologies 2007 (pp. 67-86). London: Springer.
        • Howard T. Welser, Eric Gleave, Danyel Fisher, and Marc Smith (2007) Visualizing the signatures of social roles in online discussion groups. Journal of Social Structure, 8(2). http://www.cmu.edu/joss/content/articles/volume8/Welser/

    ×