Tools and Tips for Analyzing Social Media Data

473 views

Published on

Analyzing social media may be a daunting task, given its overwhelming size and messy, unstructured nature. Further, for those new to analyzing social behavior in online systems, there are any number of pitfalls that make it challenging to find the meaning in the mess. The goal of this session is to provide practical tips for collecting and analyzing social media data.

Published in: Social Media
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
473
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
6
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Which brings us to so.cl.
    In the past year, at FUSE Labs we’ve been working on an experimental application called so.cl to explore some of these issues;
    How might we combine the capacity for searching the internet, with really lightweight sharing in the context of a social network,
    To enable informal learning
  • Through so.cl, our goal is to help users discover new interests, connect with others around common interests, and be inspired to learn more
    For example, on the right, you can see I have clicked on the tag electronic arts, an interest of mine,
    and found this post from Nathan about a water clock – which is inspiring.
  • To give you an idea of the experience, here is what I see when I log in;
    The core of the experience is through this activity feed in the middle;
    where I see posts made by people everyone in so.cl, or just the I follow;
    Or posts about things I find interesting;
    These posts are made out of searching the internet, selecting images and web sites you want to share;
    Built in a very lightweight way from my searching;
  • To give you an idea of the experience, here is what I see when I log in;
    The core of the experience is through this activity feed in the middle;
    where I see posts made by people everyone in so.cl, or just the I follow;
    Or posts about things I find interesting;
    These posts are made out of searching the internet, selecting images and web sites you want to share;
    Built in a very lightweight way from my searching;
  • Big picture:
    Learning about the real world through social media
    Social media: largest fine-grained record of human activity ever
  • Mention high-level analyses here
  • Tools and Tips for Analyzing Social Media Data

    1. 1. Analyzing Social Media Systems CHI Course 2013 Shelly Farnham, Emre Kiciman FUSE Labs & Internet Services Research Center, Microsoft Research
    2. 2. Agenda  Introductions  Overview  Lesson scenarios with real data  Usage analysis: Predictors of coming back  Social network analysis: Finding who you like  Content analysis: Relationships, cliques, and their conversation  Focus on Tools/Tips, special consideration when examining social data
    3. 3. MAKING MEANING OUT OF THE MESS
    4. 4. SHELLY FARNHAM: INDUSTRY RESEARCH  Specialize in social technologies  Social networks, community, identity, mobile social  Early stage innovation  Extremely rapid R&D cycle  study, brainstorm, design, prototype, deploy, evaluate (repeat)  Convergent evaluation methodologies: usage analysis, interviews, questionnaires  Career  PhD in Social Psychology from UW  7 years Microsoft Research Virtual Worlds, Social Computing, Community Technologies  4 years startup world  Waggle Labs (consulting), Pathable  2 Years Yahoo!  FUSE Labs, Microsoft Research Personal Map
    5. 5. EMRE KICIMAN  Specialize in social data analytics  Social media, social networks, search  Methods  Machine learning  Information extraction, entity recognition from social data  Prototyping  Career  Ph.D. and M.S. in computer science from Stanford University  B.S. in Electrical Engineering and Computer  Currently at Internet Services Research Center, Microsoft Research
    6. 6. ANALYSIS THROUGHOUT R&D CYCLE Importance of Information in selecting chat partner 7 6 5 Rank 4 Rating Similarity 3 Interacts with friends Ratings by friends 2 1 0
    7. 7. USER STUDIES
    8. 8. PROTOTYPING
    9. 9. USAGE ANALYSIS Do social responses matter in driving engagement?
    10. 10. SOCIAL MEDIA ANALYSIS  Common types  Usage analysis: behaviors, interactions  Network analysis: patterns in networks (sets of pair-wise connections)  Content analysis: semantics, sentiment of conversational content  Common steps  Step 1. Getting started: defining questions  Step 2. Processing data: extraction, cleaning, summarization  Step 3. High level analysis: inference
    11. 11. CASE STUDY: USAGE ANALYSIS So.cl usage analysis as case study scenario, lessons learned apply to other forms of social media and other forms of analysis So.cl is an experimental web site that allows people to connect around their interests by integrating search tools with social networking. How important are social interactions in encouraging users to become engaged with an interest network?
    12. 12. SO.CL reimagining search as social from the ground up search + sharing + networking = informal discovery and learning History: Oct 2011: Pre-release deployment study Dec 2011: Private, invitation-only beta May 2012: removed invitation restrictions Nov 2012: over 300K registered users, 13K active per month Try it now! http://www.so.cl
    13. 13. INTEREST NETWORK GOALS  Find others around common interests  Be inspired by new interests  Learn from each other through these shared interests
    14. 14. HOW IT WORKS Search & Post Search & Post Feed Filters Feed Filters Feed Feed People People Try it now! http://www.so.cl – use facsumm tag
    15. 15. POST BUILDING Search (Bing) Search (Bing) Filter Results Filter Results Post Builder Post Builder Results Results Experience: Step 1: Perform search Step 2: Click on items in results to add to post Step 3: Add a message Step 4: Tag Try it now! http://www.so.cl – use facsumm tag
    16. 16. USAGE ANALYSIS
    17. 17. STEP 1: GETTING STARTED
    18. 18. DEFINING RESEARCH QUESTION  Amount of data overwhelming – the more defined your question, the easier the analysis  What real world problem are you trying to explore? Avoid pitfall of technology for technology’s sake  What argument do you want to be able to make?  State your problem as a hypothesis
    19. 19. CASE SCENARIO:  Real world problem: Help people learn online  Argument want to make: People are more motivated to explore new interests via social media than via search alone because of the opportunity to connect with others.  Hypothesis: If people receive a social response when they first join So.cl they are more likely to become engaged.
    20. 20. OPERATIONALIZING CONSTRUCTS  Operationalize = to make measurable  Always review related literature for best practices  How do you measure… Friendship? Similarity? Interest? Trend? Conversation? Community? Engagement?  Can you operationalize with existing data, or do you need to generate more?
    21. 21. CASE SCENARIO:  Hypothesis: If people receive a social response when they first join So.cl they are more likely to become engaged.  Measuring social/behavioral constructs:  When first join First session = time of first action to time of last action prior to an hour of inactivity  Social responses Follows user, likes user’s post(s), comments on user’s post(s)  Engagement = coming back A second session = any action occurs 60 minutes or more after first session  Restating hypothesis:  If a people receive follows, likes, and comments in their first session they are more likely to come back for a second session
    22. 22. STEP 2. PROCESSING DATA
    23. 23. COLLECTING DATA  Existing tools  APIs (Twitter, Foursquare, Yelp)  Web analytics (Google Analytics)  Write crawlers  Writing your own instrumentation system e.g. log each call to server, query parameters
    24. 24. RAW INSTRUMENTATION  Tendency to collect everything  incomprehensible, incoherent mess  Prone towards bugs
    25. 25. INSTRUMENTATION  Convert to human readable
    26. 26. Always look at your raw data: play with it ask yourself if it makes sense, test!
    27. 27. COMMON INSTRUMENTATION SCHEMA  Users table  One row per user
    28. 28. COMMON INSTRUMENTATION SCHEMA  Actions table  One row per meaningful action  Filter out non-meaningful, non-user generated actions
    29. 29. COMMON INSTRUMENTATION SCHEMA  Content table(s):  One row per content item, with text, URL, etc. of that item e.g. messages, pictures shared, likes, tags
    30. 30. COMMON INSTRUMENTATION SCHEMA  Across tables, with social systems    instrument social target (PersonA responds to PersonB) Instrument parent item (e.g., Comment A, Comment B, Comment C, responses to parent item PostB) In other words, instrument who interacting with whom, and in what context
    31. 31. REDUCING LARGE DATA  Filters  Time span, type of person, type of actions  Sampling  Random selection  Snow balling, so get complete picture of person’s social experience  Consider your research questions, how you want to generalize
    32. 32. FILTERING & SAMPLING  Filtered out administrators/community managers  New users only  Date range: Sept 28 to Oct 13  100% sample for that time span: 2462 people
    33. 33. SYSTEMATIC BIASES IN SOCIAL SYSTEMS  If you want to understand your “typical” users, keep in mind generally find:  Large percent never become active or return --“lookiloos” can unduly bias averages Common reporting format: X% performed Y behavior, of those averaged Z times each 5% commented on a post their first session, averaging 5 times each
    34. 34. OUTLIERS  Filtered out 13 people outliers z > 4 in number of actions (if do more than sign in)
    35. 35. SYSTEMATIC BIASES IN SOCIAL SYSTEMS  A small percent “hyper-active” users: avid, spammers, trolls, administrators, and can unduly bias averages  Remove outliers  A substantial percent are consumers but not producers (“lurkers”), often no signal for lurkers  Consult literature, related work for estimates – so.cl, about 75% lurkers  Custom instrumentation, logging sign ins  Web analytics for clicks
    36. 36. PLAYING WITH YOUR DATA  Very important to spend time examining data  Descriptives, Frequencies, Correlations, Graphs  Use tool that easily generates graphs, correlations  Does it make sense? If not, really chase it down. Often a bug or misinterpretation of data.
    37. 37. AGGREGATIONS  Aggregation: merging down for summarization  What is your level of analysis?  Person, group, network  Content types  If person is unit of analysis, aggregate measures to the person level  E.g. in SPSS: One line per person  very important to have appropriate unit analysis, to avoid bias in statistics
    38. 38. AGGREGATIONS  SPSS Syntax:
    39. 39. DESCRIPTIVES OF ACTIVE SESSIONS  Active session = a time of activity (public), with 60 minute gap of no activity before or after  91% of users only one active session  On average, 34.6 hours apart  First session, 1.6 minutes
    40. 40. DESCRIPTIVES OF ACTIONS Actions in First Session A A 8% created a post there first session, of those averaged 1.5 times each
    41. 41. DESCRIPTIVES OF COMING BACK  9.1% came back for another active session (~25% including inactive)  On average, 35 hours later
    42. 42. IN THE FIRST SESSION  How often is user the target of social behavior?  23% received some response up to 2nd session ->3% if did not create a post, 37% if did create a post Response *During* First Session Response *in Between* 1st and 2nd Sessions
    43. 43. STEP 3. HIGH LEVEL ANALYSIS
    44. 44. PRELIMINARY CORRELATIONS  Always ask, does this pattern make sense ?
    45. 45. PREDICTORS OF COMING BACK  Social responses inspire people to return to site, especially if occurring during first session N = 2273 N = 179 N = 1942 N = 510 Social responses to user: following, commenting on post, liking post, liking comment, riffing
    46. 46. WHICH RESPONSE MATTERS Logistic Regression, Any Response Predicts Coming Back B S.E. Sig. Created post first session .71 .20 .000 Response1: during first session 1.12 .21 .000 Response2: after first session .60 .17 .000 Logistic Regression, Which Predicts Coming Back B Sig. Created post first session .95 .000 Followed .92 .003 Commented On .38 ns Post Liked .87 .02 Comment Liked -.09 ns Messaged -.09 ns Riffed .00 ns
    47. 47. IDENTIFYING SUBGROUPS Component Matrixa Type: % Variance: Creators 32% Component Socialites Browsers 12% 9% Created post .86 .17 .10 Invited .01 -.16 .63 Followed -.03 .10 .37 Factors about equally predict if user comes back Regression Coefficients Beta Added item to post .83 .08 -.06 Searched .81 .03 .17 Commented .36 .64 .09 Liked post .15 .58 .13 .80 .06 Messaged -.09 .50 -.08 Viewed person .22 .47 .48 Navigated to All .51 .37 Sig Creating .14 5.28 .000 Socializing .07 2.61 .000 Browsing .19 7.20 .000 .32 Liked comment t .53 Browsing stronger predictor of overall activity level Regression Coefficients .09 .68 Principle components, varimax rotation [meaning forced to be orthoganol] Factor Analysis for Associated Behaviors: Three types of usage – creating, socializing, browsing Sig 0.20 7.89 0.00 Socializing 0.17 6.58 0.00 Browsing .17 t Creating Joined party Beta 0.29 9.07 0.00
    48. 48. NETWORK ANALYSIS
    49. 49. Case Scenario 2: illustrating network analysis Real world problem: help people find and learn from others who share their interests online Argument want to make: people do not just care about content around their interests, they want to develop friendships with others who share their interests Hypothesis: People will interact with others more the more common tags they have Design implication: Recommendations based on common overlapping tags
    50. 50. PROCESSING NETWORK DATA  Common format: EntityA EntityB EntityB EntityF EntityB EntityC EntityD EntityG  Units of analysis: Edges Nodes/vertices Clusters, networks measure measure measure measure
    51. 51. OPERATIONALIZING CONNECTION  How would you measure…  Similar interests? Friendship? Information flow?  Asymmetrical?  Often some form of co-occurrence http://www.touchgraph.com/assets/navigator/help2/module_3_3.html
    52. 52. NORMALIZATION  adjusting values measured on different scales to a notionally common scale  Allow the comparison of corresponding normalized values for different datasets in a way that eliminates the effects of certain gross influence Mary Jim Bob • • • • Mary has 400 friends Jim has 200 friends Bob and 50 friends Mary and Jim have 100 overlapping friends • Mary and Bob and 50 overlapping friends • How similar are they? • Who’s more similar?
    53. 53. CASE STUDY:  Real world problem: Help people find people like them online  Argument want to make: Interests you share and tag online are good indicator of what you are like  Hypothesis: If people more interested in receiving recommendations of whom to befriend based on overlapping tags than random others in the system
    54. 54. CONNECTION VIA OVERLAPPING TAGS
    55. 55. NETWORK ANALYSIS (NODEXL)  Playing with data, learned:  All tagging not a good indicator of what you are like – the tags on your posts are, whether or not you add them  Most common tags not very meaningful, unique overlapping tags are  importance of normalization
    56. 56. CONTENT ANALYSIS
    57. 57. Douglas Wray - http://instagr.am/p/nm695/ @ThreeShipsMedia
    58. 58. Outline  What’s in social media? (donuts)  Extracting relationships and their context  Using context with higher-level analyses
    59. 59. Do people really talk about donuts?  1 week of tweets mentioning “donut” or “doughnuts”  Week of Feb 6-12, 2012.  Matched ~180k messages  Train entity tagger for food and for restaurants  (no disambiguation or canonicalization)  Let’s see what we find…
    60. 60. Where do people get donuts?
    61. 61. What do people drink with donuts?
    62. 62. What kind of donuts do people eat?
    63. 63. Beyond donuts…  Drugs, diseases, and contagions  Paul and Dredze 2011; Sadilek, Kautz and Silenzio 2012.  Crises, disasters, and wars  Starbird et al. 2010; Al-Ani, Mark & Semaan 2010; Monroy-Hernandez et al. 2012  Public Sentiment  Political and election indices, market insights  Everyday life
    64. 64. Relationships in Context
    65. 65. Stage 1: Feature extraction “I had fun hiking Tiger Mountain last weekend” – Alice said on Monday, at 10am Location Mood Activity Name Gender Post Time Activity Time Tiger Mountain Happy Hiking Alice Female Mon 10am {Sat-Sun}
    66. 66. Stage 2(A) Build a hyper-graph representation “I had fun hiking Tiger Mountain last weekend” – Alice said on Monday, at 10am Location: Tiger Mountain Gender: Female Mood: Happy Activity: Hiking Name: Alice Post Time: Mon 10am Activity Time: {Sat-Sun}
    67. 67. Gender: Male Name: Bob Post Time: Fri 3pm Location: Tiger Mountain Gender: Female Mood: Happy Activity: Hiking Name: Alice Post Time: Mon 10am Activity Time: {Sat-Sun}
    68. 68. Stage 2(B) Projection • Reduce graph to key domains • Statistical distributions of other domains provide key context Location: Tiger Mountain Activity: Hiking
    69. 69. Gender: Male Location: Tiger Mountain Activity: Hiking Gender: Female
    70. 70. Demo to show example relationships & contexts from several domains
    71. 71. Using context with high-level analyses Current  Clustering  Neighborhood discovery  Network centrality  Context of discussion provides
    72. 72. Demo to show example contexts for pseudocliques and network centrality
    73. 73. CONCLUSIONS  Define research questions early to help focus analysis  Many special considerations with social media data     Operationalizing social constructs Attention to lookiloos, hyperactives, lurkers who bias outcomes Different types of users = different behaviors Different context meaningfully impact conversation  Processing data = simplification, getting meaningful measures summarized at appropriate level of analysis  Format your data and plug it into appropriate tool to enable you play with your data a *lot*  Important for debugging, finding patterns  Great tools available for leveraging social media to describe, predict behaviors
    74. 74. CONTACT INFO Shelly Farnham, Researcher Emre Kiciman, Researcher (@shellyshelly; shellyfa@microsoft.com) (emrek@microsoft.com) QUESTIONS

    ×