Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Analyzing Social Media with Digital Methods. Possibilities, Requirements, and Limitations


Published on

Lecture given on October 22 2015 at the Universidade Nova de Lisboa.

Published in: Education
  • Be the first to comment

Analyzing Social Media with Digital Methods. Possibilities, Requirements, and Limitations

  1. 1. Analyzing Social Media with Digital Methods Possibilities, Requirements, and Limitations Bernhard Rieder Universiteit van Amsterdam Mediastudies Department
  2. 2. The starting point Social media are playing important roles in contemporary society, from the very personal to the very public. Many disciplines have begun to study social media, applying various methodologies (ethnography, questionnaires, etc.), but there is an explosion in data-driven research that relies on the computational analysis of data gleaned from social media platforms. The promise is (cheap and detailed) access to what people do, not what they say they do; to their behavior, exchange, ideas, and sentiments.
  3. 3. This presentation This talk introduces social media analysis using digital methods from a theoretically involved yet "practical" perspective. Instead of laying out an overarching "logic" of social media data analysis, I focus on the basic setup and the rich reservoir of analytical gestures that constitute the practice of data analysis. 1 / A (long) introduction 2 / Three examples covering Facebook, Twitter, and YouTube 3 / Some conclusions and recommendations
  4. 4. 1 / Introduction Social media services host an increasing number of relevant phenomena, including everyday practices, political presentation and debate, social and political activism, disaster communication, etc. A number of preliminary remarks: ☉ The phenomena one is interested in may not happen or resonate on social media; many things happen elsewhere. ☉ Even if one's research focus is on social media, one may not get the data. ☉ One requires a least some technical competence and the willingness to confront and learn about a number of technical matters. ☉ Every social media "platform" (Gillespie 2013) is different and requires a different approach; cf. "medium-specificity".
  5. 5. 1 / Introduction Hypothetico-deductive approaches are certainly possible, but this presentation espouses inductive "exploratory data analysis" (Tukey 1962) that emphasizes iteration, methodological flexibility, adjustment of questions, and "grounded theory" (Glaser & Strauss 1965). "Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise. Data analysis must progress by approximate answers, at best, since its knowledge of what the problem really is will at best be approximate." (Tukey 1962)
  6. 6. 1 / Introduction How does social media analysis with digital methods work? social media platform e.g. Twitter, Facebook users communicate, interact, express, publish, etc. through "grammars of action" (forms and functions) rendered in software API technical interface to the data, defined in technical, legal, and logistical terms extraction software e.g. DMI-TCAT, Netvizz makes calls to API, creates "views" by combing data into specific sets or metrics, produces outputs provides visual or textual representation of view, e.g. an interactive chart data in standard file format, e.g. CSV allows analyzing files in various ways, e.g. statistics, graph theory output type 1: widget output type 2: file analysis software, e.g. Excel, gephi layers of technical mediation that one might want to think about 1 2 3 4
  7. 7. 1 / Introduction - a / the platform Social media services channel communication, interaction, etc. through "grammars of action" (forms and functions) rendered in software; users appropriate these affordances. Every service is different. Every service changes over time, both in terms of technology and user practices. Homogeneous interfaces do not mean homogeneous practices. Platforms strive to capture large audiences and leave important margins to users.
  8. 8. Social media platforms are organized around instances of predefined types of entities (users, messages, hashtags, posts, etc.) and connections between them. They formalize and channel expression, exchange, and coordination and data fields are closely related to these formalizations.
  9. 9. Data fields mirror forms and functions of the platform.
  10. 10. Social media are different from the "open" Web because most data is formalized in fields and a "semantic data model". The more detailed the formalization, the more salient the data. Social media platforms are essentially large databases. 1 / Introduction - a / the platform
  11. 11. Very large numbers and variety in users, contents, purposes, arrangements, etc.
  12. 12. Social media are built around simple point-to-point principles; this allows for a variety of configurations to emerge over time. Every account is the same, but there are vast differences in scale. We need to begin with technical fieldwork and conceptualization of the platform.
  13. 13. 1 / Introduction - b / the APIs There are two possibilities to collect data automatically from social media platforms: scraping the user interface or collecting via specified application programming interfaces (APIs). APIs specify (technically, legally, logistically): ☉ What data can be retrieved (certain fields may be inaccessible or incomplete); ☉ How much data can be retrieved (all APIs have rate limits); ☉ The span of coverage (temporal limitations apply often); ☉ The perceptivity of coverage (privacy or personalization can skew access); For example, Facebook (currently) provides these variables for each post: comment like share count yes yes yes individual user list yes yes no time-stamp yes no no
  14. 14. Social media users produce detailed data traces; data pools in social media are centralized and retrievable. Structure of APIs is closely related to given formalizations. In order to select, process, and interpret data we need to understand the platform: entities, relations, modes of aggregation, metrics, etc. Every platform is different and we thus need medium-specific data analysis.
  15. 15. 1 / Introduction - c / the extraction software Extraction software are the programs that connect to the APIs, retrieve data, and produce specific outputs. Can range from custom-written scripts to one-click visualization widgets. These programs work with API data, but add their own "epistemological twist", i.e. produce particular views on the data. Sampling is often difficult, therefore n = all is the norm. Extraction software can be very simple and completely free or have steep technical, logistical, and financial requirements.
  16. 16. Example for a widget: Hashtagify
  17. 17. Example for a commercial service: Topsy
  18. 18. Example for a on open source analytics suite: DMI-TCAT
  19. 19. Example for a on open source analytics suite: DMI-TCAT
  20. 20. There are many different tools out there, with different conceptual underpinnings, ease of use, depth, etc. Data analysis (statistics): Excel, SPSS, Tableau, Wizard, Mondrian, … Data analysis (graph): Gephi, NodeXL, Pajek, … Data analysis (other): Rapidminer, SentiStrength, Wordij, … Data analysis (custom): R, Python (NLTK, NumPy & SciPy), … This presentation relies mostly on R (R Core Team 2014) and Gephi (Bastian, Heymann, Jacomy 2009). 1 / Introduction - d / the analysis software
  21. 21. 1 / Introduction - d / the analysis software Analysis software provide analytical gestures to apply to the data; may be integrated into the extraction software or not. We investigate the structure of data by creating "views" of the data. Analytical gestures produce orderings, lists, tables, charts, coefficients etc. that are saying something about the data and thus the phenomenon. Flusser (1991) describes gestures as having convention and structure, but as different from reflexes, because translating a moment of freedom. The notion of gesture indicates that data does not speak for itself, we approach it with particular epistemic techniques (methods) related to a sense of purpose, a "will to know" (Foucault 1976).
  22. 22. Analytical gestures develop from the tension between a "research purpose" (question, exploration, etc.) and the available data: The technical dimension of data (via platform, API, extractions software): ☉ Available units, variables, etc. ☉ Temporal coverage, completeness, perceptivity, etc. ☉ Technical formats, available "views", etc. The semantic dimension of data (aspects of practice): ☉ Demographic (age, sex, income, etc.) ☉ Post-demographic (tastes, preferences, etc.) ☉ Behavioral (trajectories, interaction, etc.) ☉ Expressive (messages, comments, etc.) ☉ Technical (informing on the platform's functioning) 1 / Introduction - d / the analysis software
  23. 23. Statistics Observed: objects and properties ("cases") Data representation: the table Visual representation: quantity charts Inferred: relations between properties Grouping: class (similar properties) Graph theory Observed: objects and relations Data representation: the adjacency matrix Visual representation: network diagrams Inferred: structure of relations between objects Grouping: clique (dense relations) 1 / Introduction - d / the analysis software
  24. 24. Quetelet 1827, Galton 1885, Pearson 1901 Regression, PCA, etc. are potentially useful. 1 / Introduction - d / the analysis software
  25. 25. Entities seem straightforward because data is well structured, but variations in scale and practice require being careful. Descriptive statistics for social media often profit from attention to the form of a distribution; visualization, multi-point summaries, and metrics like kurtosis or skewness are very useful. 1 / Introduction - d / the analysis software
  26. 26. 1 / Introduction - d / the analysis software Moreno 1934, Forsythe and Katz 1946 Graph theory, "a mathematical model for any system involving a binary relation" (Harary 1969)
  27. 27. Three different force-based layouts of my FB profile OpenOrd, ForceAtlas, Fruchterman-Reingold
  28. 28. Non force-based layouts Circle diagram, parallel bubble lines, arc diagram
  29. 29. Nine measures of centrality (Freeman 1979) Network statistics (e.g. degrees, distances, density, etc.) can help describing and comparing networks. Graph theory also provides many mathematical tools to derive metrics from the structure of a network (e.g. "centrality", "influence", "authority", etc.), to identify groupings, etc.
  30. 30. "Facebook Likes can be used to automatically and accurately predict a range of highly sensitive personal attributes including: sexual orientation, ethnicity, religious and political views, personality traits, intelligence, happiness, use of addictive substances, parental separation, age, and gender." (Kosinski, Stillwell, Graepel 2013) There are many new(ish) techniques coming from computer science for automatic classification, prediction, sentiment analysis, etc. 1 / Introduction - d / the analysis software
  31. 31. 1 / Introduction - conclusion Four layers of technical mediation to take into account: the platform itself, the API, the extraction software, the analytical techniques. To do productive work, attention to these four layers needs to be combined with theoretical resources and case knowledge. Bringing this together requires iteration and flexibility; it's “detective work – numerical detective work – or counting detective work – or graphical detective work” (Tukey, 1977).
  32. 32. 2 / Examples - a / Facebook Facebook is the largest social media platform with 1.5B monthly active users. It incorporates networked communication (friend-to-friend), group communication (Facebook Groups), and "mass" communication (Facebook Pages). A lot of analytical possibilities disappeared in April 2015 due to a comprehensive push for more privacy; open FB Groups and FB Pages are now the main entryways. Extraction tool used: Netvizz (Rieder 2013) Main example: Kullena Khaled Said Page (Rieder et al. 2015)
  33. 33. FB Pages allow for retrieval of historical data without time limit. 14K posts, 1.9M active users, 6.8M comments (99.9% Arabic), 32M likes Kullena Khaled Said was created in June 2010 by Wael Ghonim after Khaled Said was beaten to death by Egyptian police.
  34. 34. comment like share count yes yes yes individual user list yes yes no time-stamp yes no no There is a lot of material for analysis, but these numbers need extensive data critique.
  35. 35. Data quality is high but the platform is complex and changing over time. Is the linked content part of the data? These elements can drown in a large data set and skew it. The quantitative is full of qualitative considerations.
  36. 36. 0 10000 20000 30000 40000 50000 2010−06−10 2011−01−01 2011−01−25 2012−01−01 2012−01−25 2013−01−01 2013−01−25 2013−07−03 date comments_count_fb type link music photo question status video Kullena Khaled Said, June 2010 – July 2013 posts per comment (timescatter)
  37. 37. Kullena Khaled Said, June 2010 – July 2013 posts per comment (timescatter), y-scale log10 10 1000 2010−06−10 2011−01−01 2011−01−25 2012−01−01 2012−01−25 2013−01−01 2013−01−25 2013−07−03 date comments_count_fb type link music photo question status video
  38. 38. 0 500 1000 2010−06 2010−07 2010−08 2010−09 2010−10 2010−11 2010−12 2011−01 2011−02 2011−03 2011−04 2011−05 2011−06 2011−07 2011−08 2011−09 2011−10 2011−11 2011−12 2012−01 2012−02 2012−03 2012−04 2012−05 2012−06 2012−07 2012−08 2012−09 2012−10 2012−11 2012−12 2013−01 2013−02 2013−03 2013−04 2013−05 2013−06 2013−07 date count type link music photo question status video Kullena Khaled Said, June 2010 – July 2013 page posts (n=14,072) by type, per month
  39. 39. Kullena Khaled Said, June 2010 – July 2013 Overview statistics
  40. 40. Kullena Khaled Said, June 2010 – July 2013 Comment speed
  41. 41. Kullena Khaled Said, June 2010 – July 2013 Comment length in characters
  42. 42. Kullena Khaled Said, June 2010 – July 2013 Rank-size distribution of ranked users (n = 1.9M) and likes/comments
  43. 43. "Distant reading" 1: Tag cloud tool for comments on a post
  44. 44. Distant reading 2: The comment search tool allows for exploration of comment contents. corruptiontorture
  45. 45. Manual translation: we used quantitative indicators to select posts and comments for qualitative analysis
  46. 46. Bipartite comment network June 2010 – July 2013 Nodes: posts (date: heat scale) / users (grey) Edges: commenting (invisible)
  47. 47. Bipartite comment network June 2010 – July 2013 Nodes: users (degree: heat scale) Edges: commenting (invisible)
  48. 48. SIOTW Page Network, from DMI project on right-wing extremism and anti-Islamism
  49. 49. FB like network, seed: SIOW, depth: 2, size: in-degree, color: modularity
  50. 50. FB like network, seed: SIOW, depth: 2, size: in-degree, color (heat): PageRank
  51. 51. FB like network, seed: SIOW, depth: 2, size: in-degree, color: modularity
  52. 52. FB like network, seed: SIOW, depth: 2, size: in-degree, color: modularity
  53. 53. 2 / Examples – a / Facebook For Kullena Khaled Said, we were not only able to confirm the importance of the page for the Egyptian revolution, but gain a much better understanding of the dynamics of "connective action" (Bennett & Segerberg) and what we called "connective leadership". For the SIOTW network of self-declared affiliations, we were able to nuance the complicated and skewed relationship between right-wing anti- Islamism and Israeli actors and institutions. While API-based research into private relations and interactions on Facebook has become practically impossible, there are many opportunities for investigating public (Pages) and semi-public (Groups) settings.
  54. 54. 2 / Examples – b / Twitter While Twitter has fewer users than Facebook (320M MAU), it is used a lot in the context of media debate, political conversation, and activism. Twitter has very few privacy limitations, but data needs to be captured in real time. To access the archive, one has to pay. But there is a 1% sample. Extraction tool used: DMI-TCAT (Borra & Rieder 2014) Main example: #gamergate
  55. 55. #gamergate project preliminary exploration: is it about "ethics in game journalism" or a neo-conservative hate movement?
  56. 56. There are counts everywhere, but anything here can be exploited for analysis. Because of temporal limitations, Twitter analysis means creating databases of collected tweets.
  57. 57. DMI-TCAT, analysis interface #gamergate in September 2015 DMI-TCAT allows tracking keywords, user accounts, and the 1% sample.
  58. 58. DMI-TCAT, analysis interface #gamergate in September 2015
  59. 59. Medium specificity: legal elements Medium specificity: technical and functional elements
  60. 60. DMI-TCAT & gephi, #gamergate in September 2015 Top 5000 user network
  61. 61. DMI-TCAT & gephi, #gamergate in September 2015 Top 5000 users mention stats: Mean: 89 Median: 8 p90: 124 / p95: 279 / p99: 1943
  62. 62. DMI-TCAT & gephi, #gamergate in September 2015 Top 5000 user network: Avg. degree: 33 Avg. weighted degree: 67.3 Avg. path length: 2.97
  63. 63. DMI-TCAT & gephi, #gamergate in September 2015 Co-hashtag analysis, size: frequency, color: degree
  64. 64. DMI-TCAT & gephi, #gamergate in September 2015 Co-hashtag analysis, size: frequency, color: user diversity
  65. 65. DMI-TCAT (cascade interface), x: time, y: user account point: tweet, arc: retweet, bots in red
  66. 66. Associational profile around #feminism in #gamergate dataset
  67. 67. 2 / Examples – b / Twitter Twitter is a very open platform, the main problem is the requirement to anticipate or react quickly since historical tweets are costly. Since tweets can be easily sent by bots and automators, we have to be very careful with metrics and always check from a number of different perspectives. For #gamergate, first findings show a very densely connected community organized around a group of highly active and visible accounts. Hashtag use (discounting bots) is dominated by outrage against perceived "minority favoritism", "social justice warriors", and anti-abuse measures; "ethics in journalism" is not prominent at all.
  68. 68. 2 / Examples - c / YouTube YouTube is maybe the most understudied (witch digital methods) of the large social media platforms (1B+ users). YouTube is probably the most open social media platform, with very few limitations on the API level. YouTube Data Tools (YTDT), a new tool, is an attempt to facilitate data- driven research.
  69. 69. YouTube Data Tools Extracts Data from YouTube
  70. 70. YouTube Data Tools Channel Network uses data from the "Featured Channels", which allows for self-affiliation with other channels.
  71. 71. Gamergate channel network, via YouTube channel search, depth: 1; Size: subscriber count / Color: seed or not
  72. 72. Gamergate channel network, via YouTube channel search, depth: 1; Size: subscriber count / Color: in-degree
  73. 73. Gamergate channel network, via YouTube channel search, depth: 1; Size: subscriber count / Color: betweenness
  74. 74. 3 / Conclusions Social media analysis with digital methods relies on the "natively digital objects" (Rogers 2013) that platforms are built around; technical mediation intervenes in all stages of the research process. Despite the promise of easy access to well-structured data, there are considerable difficulties and limitations. Digital methods is not a one-click type of research, but requires considerable time and critical interrogation to produce robust results: which objects to take into account, how to create a sample / collection, how to analyze it, how to interpret, how to make findings.
  75. 75. 3 / Conclusions In order to deal with big and complex datasets, we need exploratory approaches that combine micro/macro and qualitative/quantitative in various ways: ☉ Investigate the platform in detail to account for technical pitfalls. ☉ Qualify quantities. ☉ Gain a sense of practices to orient quantitative methods. ☉ Use quantitative indicators to decide on qualitative focus. ☉ Read content to understand outliers. ☉ Make explicit plausibility tests based on reading. ☉ Interpret the small in relation to the large and the other way round. Because n=all these articulations have become much more feasible. Every analytical gesture shows different things, combination completes the picture. We need "flexibility of attack, willingness to iterate" (Tukey 1962).
  76. 76. 3 / Conclusions There is a lot of excitement about social media data analysis, but our techniques are often still experimental and far from standardized. We need interrogation and critiques of methodology that are developed from engagement and historical / conceptual investigation. We need analytical gestures that are more closely tied to concepts from the humanities and social sciences. Visualization and simple tools are very interesting, but require technical and conceptual literacy to deliver more than (deceptive) illustrations.
  77. 77. 3 / Conclusions Data analysis for social media requires (in my view): ☉ Robust understanding of the social media platform; ☉ A sense of purpose; ☉ Conceptual understanding of methods and analytical gestures; ☉ Knowledge of software tools for data analysis; ☉ Considerable domain expertise; If you think that these approaches can be interesting for your research, I would recommend to simply try out some of the tools to get a first-hand impression.
  78. 78. Thank You! @RiederB All mentioned data extraction tools are freely available via and