Analyzing Social Media with Digital Methods. Possibilities, Requirements, and Limitations

Analyzing Social Media with Digital Methods
Possibilities, Requirements, and Limitations
Bernhard Rieder
Universiteit van Amsterdam
Mediastudies Department

The starting point
Social media are playing important roles in contemporary society, from the
very personal to the very public.
Many disciplines have begun to study social media, applying various
methodologies (ethnography, questionnaires, etc.), but there is an
explosion in data-driven research that relies on the computational analysis
of data gleaned from social media platforms.
The promise is (cheap and detailed) access to what people do, not what
they say they do; to their behavior, exchange, ideas, and sentiments.

This presentation
This talk introduces social media analysis using digital methods from a
theoretically involved yet "practical" perspective.
Instead of laying out an overarching "logic" of social media data analysis, I
focus on the basic setup and the rich reservoir of analytical gestures that
constitute the practice of data analysis.
1 / A (long) introduction
2 / Three examples covering Facebook, Twitter, and YouTube
3 / Some conclusions and recommendations

1 / Introduction
Social media services host an increasing number of relevant phenomena,
including everyday practices, political presentation and debate, social and
political activism, disaster communication, etc.
A number of preliminary remarks:
☉ The phenomena one is interested in may not happen or resonate on social media;
many things happen elsewhere.
☉ Even if one's research focus is on social media, one may not get the data.
☉ One requires a least some technical competence and the willingness to confront
and learn about a number of technical matters.
☉ Every social media "platform" (Gillespie 2013) is different and requires a different
approach; cf. "medium-specificity".

1 / Introduction
Hypothetico-deductive approaches are certainly possible, but this
presentation espouses inductive "exploratory data analysis" (Tukey 1962)
that emphasizes iteration, methodological flexibility, adjustment of
questions, and "grounded theory" (Glaser & Strauss 1965).
"Far better an approximate answer to the right question, which is often
vague, than an exact answer to the wrong question, which can always be
made precise. Data analysis must progress by approximate answers, at
best, since its knowledge of what the problem really is will at best be
approximate." (Tukey 1962)

1 / Introduction
How does social media analysis with digital methods work?
social media platform
e.g. Twitter, Facebook
users communicate,
interact, express, publish,
etc. through "grammars
of action" (forms and
functions) rendered in
software
API
technical interface to the
data, defined in technical,
legal, and logistical terms
extraction software
e.g. DMI-TCAT, Netvizz
makes calls to API,
creates "views" by
combing data into
specific sets or
metrics, produces
outputs
provides visual or textual
representation of view,
e.g. an interactive chart
data in standard file
format, e.g. CSV
allows analyzing
files in various
ways, e.g.
statistics, graph
theory
output type 1: widget
output type 2:
file
analysis software,
e.g. Excel, gephi
layers of technical mediation that one might want to think about
1
2
3 4

1 / Introduction - a / the platform
Social media services channel communication, interaction, etc. through
"grammars of action" (forms and functions) rendered in software; users
appropriate these affordances.
Every service is different. Every service changes over time, both in terms
of technology and user practices.
Homogeneous interfaces do not mean homogeneous practices. Platforms
strive to capture large audiences and leave important margins to users.

Social media platforms are organized
around instances of predefined
types of entities (users, messages,
hashtags, posts, etc.) and
connections between them.
They formalize and channel
expression, exchange, and
coordination and data fields
are closely related to these
formalizations.

Data fields mirror forms and
functions of the platform.

Social media are different from the "open" Web because most data is
formalized in fields and a "semantic data model".
The more detailed the formalization, the more salient the data.
Social media platforms are essentially large databases.
1 / Introduction - a / the platform

Very large numbers and variety in users,
contents, purposes, arrangements, etc.

Social media are built around simple
point-to-point principles; this allows
for a variety of configurations to
emerge over time.
Every account is the same, but there
are vast differences in scale. We need
to begin with technical fieldwork and
conceptualization of the platform.

1 / Introduction - b / the APIs
There are two possibilities to collect data automatically from social media
platforms: scraping the user interface or collecting via specified
application programming interfaces (APIs).
APIs specify (technically, legally, logistically):
☉ What data can be retrieved (certain fields may be inaccessible or incomplete);
☉ How much data can be retrieved (all APIs have rate limits);
☉ The span of coverage (temporal limitations apply often);
☉ The perceptivity of coverage (privacy or personalization can skew access);
For example, Facebook (currently) provides these variables for each post:
comment like share
count yes yes yes
individual user list yes yes no
time-stamp yes no no

Social media users produce
detailed data traces; data
pools in social media are
centralized and retrievable.
Structure of APIs is closely
related to given formalizations.
In order to select, process, and
interpret data we need to
understand the platform:
entities, relations, modes of
aggregation, metrics, etc.
Every platform is different and
we thus need medium-specific
data analysis.

1 / Introduction - c / the extraction software
Extraction software are the programs that connect to the APIs, retrieve
data, and produce specific outputs.
Can range from custom-written scripts to one-click visualization widgets.
These programs work with API data, but add their own "epistemological
twist", i.e. produce particular views on the data. Sampling is often
difficult, therefore n = all is the norm.
Extraction software can be very simple and completely free or have steep
technical, logistical, and financial requirements.

Example for a widget:
Hashtagify

Example for a commercial
service: Topsy

Example for a on open source
analytics suite: DMI-TCAT

There are many different tools out there, with different conceptual
underpinnings, ease of use, depth, etc.
Data analysis (statistics): Excel, SPSS, Tableau, Wizard, Mondrian, …
Data analysis (graph): Gephi, NodeXL, Pajek, …
Data analysis (other): Rapidminer, SentiStrength, Wordij, …
Data analysis (custom): R, Python (NLTK, NumPy & SciPy), …
This presentation relies mostly on R (R Core Team 2014) and Gephi
(Bastian, Heymann, Jacomy 2009).
1 / Introduction - d / the analysis software

Analysis software provide analytical gestures to apply to the data; may be
integrated into the extraction software or not.
We investigate the structure of data by creating "views" of the data.
Analytical gestures produce orderings, lists, tables, charts, coefficients etc.
that are saying something about the data and thus the phenomenon.
Flusser (1991) describes gestures as having convention and structure, but
as different from reflexes, because translating a moment of freedom.
The notion of gesture indicates that data does not speak for itself, we
approach it with particular epistemic techniques (methods) related to a
sense of purpose, a "will to know" (Foucault 1976).

Analytical gestures develop from the tension between a "research
purpose" (question, exploration, etc.) and the available data:
The technical dimension of data (via platform, API, extractions software):
☉ Available units, variables, etc.
☉ Temporal coverage, completeness, perceptivity, etc.
☉ Technical formats, available "views", etc.
The semantic dimension of data (aspects of practice):
☉ Demographic (age, sex, income, etc.)
☉ Post-demographic (tastes, preferences, etc.)
☉ Behavioral (trajectories, interaction, etc.)
☉ Expressive (messages, comments, etc.)
☉ Technical (informing on the platform's functioning)

Statistics
Observed: objects and properties ("cases")
Data representation: the table
Visual representation: quantity charts
Inferred: relations between properties
Grouping: class (similar properties)
Graph theory
Observed: objects and relations
Data representation: the adjacency matrix
Visual representation: network diagrams
Inferred: structure of relations between objects
Grouping: clique (dense relations)

Quetelet 1827, Galton 1885, Pearson 1901
Regression, PCA, etc. are potentially useful.

Entities seem straightforward because data is well structured, but
variations in scale and practice require being careful.
Descriptive statistics for social media often profit from attention to the form of a distribution;
visualization, multi-point summaries, and metrics like kurtosis or skewness are very useful.

Moreno 1934, Forsythe and Katz 1946
Graph theory, "a mathematical model for any
system involving a binary relation" (Harary 1969)

Three different force-based layouts of my FB profile
OpenOrd, ForceAtlas, Fruchterman-Reingold

Non force-based layouts
Circle diagram, parallel bubble lines, arc diagram

Nine measures of centrality (Freeman 1979)
Network statistics (e.g.
degrees, distances, density,
etc.) can help describing and
comparing networks.
Graph theory also provides
many mathematical tools to
derive metrics from the
structure of a network (e.g.
"centrality", "influence",
"authority", etc.), to identify
groupings, etc.

"Facebook Likes can be used to automatically and
accurately predict a range of highly sensitive
personal attributes including: sexual orientation,
ethnicity, religious and political views,
personality traits, intelligence, happiness, use of
addictive substances, parental separation, age,
and gender." (Kosinski, Stillwell, Graepel 2013)
There are many new(ish) techniques
coming from computer science for
automatic classification, prediction,
sentiment analysis, etc.

1 / Introduction - conclusion
Four layers of technical mediation to take into account: the platform itself,
the API, the extraction software, the analytical techniques.
To do productive work, attention to these four layers needs to be
combined with theoretical resources and case knowledge.
Bringing this together requires iteration and flexibility; it's “detective work
– numerical detective work – or counting detective work – or graphical
detective work” (Tukey, 1977).

2 / Examples - a / Facebook
Facebook is the largest social media platform with 1.5B monthly active
users. It incorporates networked communication (friend-to-friend), group
communication (Facebook Groups), and "mass" communication (Facebook
Pages).
A lot of analytical possibilities disappeared in April 2015 due to a
comprehensive push for more privacy; open FB Groups and FB Pages are
now the main entryways.
Extraction tool used: Netvizz (Rieder 2013)
Main example: Kullena Khaled Said Page (Rieder et al. 2015)

FB Pages allow for retrieval
of historical data without
time limit.
14K posts, 1.9M active
users, 6.8M comments
(99.9% Arabic), 32M likes
Kullena Khaled Said was
created in June 2010 by
Wael Ghonim after Khaled
Said was beaten to death
by Egyptian police.

comment like share
count yes yes yes
individual user list yes yes no
time-stamp yes no no
There is a lot of material for
analysis, but these numbers
need extensive data critique.

Data quality is high but the
platform is complex and
changing over time.
Is the linked content part
of the data?
These elements can drown
in a large data set and
skew it.
The quantitative is full of
qualitative considerations.

0
10000
20000
30000
40000
50000
2010−06−10
2011−01−01
2011−01−25
2012−01−01
2012−01−25
2013−01−01
2013−01−25
2013−07−03
date
comments_count_fb
type
link
music
photo
question
status
video
Kullena Khaled Said, June 2010 – July 2013 posts per
comment (timescatter)

Kullena Khaled Said, June 2010 – July 2013 posts per
comment (timescatter), y-scale log10
10
1000
2010−06−10
2011−01−01
2011−01−25
2012−01−01
2012−01−25
2013−01−01
2013−01−25
2013−07−03
date
comments_count_fb
type
link
music
photo
question
status
video

0
500
1000
2010−06
2010−07
2010−08
2010−09
2010−10
2010−11
2010−12
2011−01
2011−02
2011−03
2011−04
2011−05
2011−06
2011−07
2011−08
2011−09
2011−10
2011−11
2011−12
2012−01
2012−02
2012−03
2012−04
2012−05
2012−06
2012−07
2012−08
2012−09
2012−10
2012−11
2012−12
2013−01
2013−02
2013−03
2013−04
2013−05
2013−06
2013−07
date
count
type
link
music
photo
question
status
video
Kullena Khaled Said, June 2010 – July 2013 page
posts (n=14,072) by type, per month

Kullena Khaled Said, June 2010 – July 2013
Overview statistics

Comment speed

Comment length in characters

Rank-size distribution of ranked users (n = 1.9M) and likes/comments

"Distant reading" 1:
Tag cloud tool for
comments on a post

Distant reading 2:
The comment search tool
allows for exploration of
comment contents.
corruptiontorture

Manual translation: we used
quantitative indicators to
select posts and comments
for qualitative analysis

Bipartite comment network
June 2010 – July 2013
Nodes: posts (date: heat scale) / users (grey)
Edges: commenting (invisible)

Bipartite comment network
June 2010 – July 2013
Nodes: users (degree: heat scale)
Edges: commenting (invisible)

SIOTW Page Network, from DMI
project on right-wing extremism
and anti-Islamism

FB like network, seed: SIOW, depth: 2,
size: in-degree, color: modularity

FB like network, seed: SIOW, depth: 2,
size: in-degree, color (heat): PageRank

2 / Examples – a / Facebook
For Kullena Khaled Said, we were not only able to confirm the importance
of the page for the Egyptian revolution, but gain a much better
understanding of the dynamics of "connective action" (Bennett &
Segerberg) and what we called "connective leadership".
For the SIOTW network of self-declared affiliations, we were able to
nuance the complicated and skewed relationship between right-wing anti-
Islamism and Israeli actors and institutions.
While API-based research into private relations and interactions on
Facebook has become practically impossible, there are many opportunities
for investigating public (Pages) and semi-public (Groups) settings.

2 / Examples – b / Twitter
While Twitter has fewer users than Facebook (320M MAU), it is used a lot
in the context of media debate, political conversation, and activism.
Twitter has very few privacy limitations, but data needs to be captured in
real time. To access the archive, one has to pay. But there is a 1% sample.
Extraction tool used: DMI-TCAT (Borra & Rieder 2014)
Main example: #gamergate

#gamergate project preliminary exploration:
is it about "ethics in game journalism" or a
neo-conservative hate movement?

There are counts everywhere,
but anything here can be
exploited for analysis.
Because of temporal limitations,
Twitter analysis means creating
databases of collected tweets.

DMI-TCAT, analysis interface
#gamergate in September 2015
DMI-TCAT allows tracking keywords,
user accounts, and the 1% sample.

DMI-TCAT, analysis interface
#gamergate in September 2015

Medium specificity: legal elements
Medium specificity: technical and functional elements

DMI-TCAT & gephi, #gamergate
in September 2015
Top 5000 user network

in September 2015
Top 5000 users mention stats:
Mean: 89
Median: 8
p90: 124 / p95: 279 / p99: 1943

in September 2015
Top 5000 user network:
Avg. degree: 33
Avg. weighted degree: 67.3
Avg. path length: 2.97

in September 2015
Co-hashtag analysis, size:
frequency, color: degree

in September 2015
Co-hashtag analysis, size:
frequency, color: user diversity

DMI-TCAT (cascade interface), x: time, y: user account
point: tweet, arc: retweet, bots in red

Associational profile around
#feminism in #gamergate dataset

2 / Examples – b / Twitter
Twitter is a very open platform, the main problem is the requirement to
anticipate or react quickly since historical tweets are costly.
Since tweets can be easily sent by bots and automators, we have to be
very careful with metrics and always check from a number of different
perspectives.
For #gamergate, first findings show a very densely connected community
organized around a group of highly active and visible accounts.
Hashtag use (discounting bots) is dominated by outrage against perceived
"minority favoritism", "social justice warriors", and anti-abuse measures;
"ethics in journalism" is not prominent at all.

2 / Examples - c / YouTube
YouTube is maybe the most understudied (witch digital methods) of the
large social media platforms (1B+ users).
YouTube is probably the most open social media platform, with very few
limitations on the API level.
YouTube Data Tools (YTDT), a new tool, is an attempt to facilitate data-
driven research.

YouTube Data Tools
Extracts Data from YouTube

YouTube Data Tools
Channel Network uses data from the
"Featured Channels", which allows for
self-affiliation with other channels.

Gamergate channel network, via YouTube
channel search, depth: 1;
Size: subscriber count / Color: seed or not

Size: subscriber count / Color: in-degree

Size: subscriber count / Color: betweenness

3 / Conclusions
Social media analysis with digital methods relies on the "natively digital
objects" (Rogers 2013) that platforms are built around; technical
mediation intervenes in all stages of the research process.
Despite the promise of easy access to well-structured data, there are
considerable difficulties and limitations.
Digital methods is not a one-click type of research, but requires
considerable time and critical interrogation to produce robust results:
which objects to take into account, how to create a sample / collection,
how to analyze it, how to interpret, how to make findings.

3 / Conclusions
In order to deal with big and complex datasets, we need exploratory
approaches that combine micro/macro and qualitative/quantitative in
various ways:
☉ Investigate the platform in detail to account for technical pitfalls.
☉ Qualify quantities.
☉ Gain a sense of practices to orient quantitative methods.
☉ Use quantitative indicators to decide on qualitative focus.
☉ Read content to understand outliers.
☉ Make explicit plausibility tests based on reading.
☉ Interpret the small in relation to the large and the other way round.
Because n=all these articulations have become much more feasible.
Every analytical gesture shows different things, combination completes the
picture. We need "flexibility of attack, willingness to iterate" (Tukey 1962).

3 / Conclusions
There is a lot of excitement about social media data analysis, but our
techniques are often still experimental and far from standardized.
We need interrogation and critiques of methodology that are developed
from engagement and historical / conceptual investigation.
We need analytical gestures that are more closely tied to concepts from
the humanities and social sciences.
Visualization and simple tools are very interesting, but require technical
and conceptual literacy to deliver more than (deceptive) illustrations.

3 / Conclusions
Data analysis for social media requires (in my view):
☉ Robust understanding of the social media platform;
☉ A sense of purpose;
☉ Conceptual understanding of methods and analytical gestures;
☉ Knowledge of software tools for data analysis;
☉ Considerable domain expertise;
If you think that these approaches can be interesting for your research, I
would recommend to simply try out some of the tools to get a first-hand
impression.

Thank You!
rieder@uva.nl
@RiederB
http://thepoliticsofsystems.net
https://www.digitalmethods.net
All mentioned data extraction tools are freely available via
http://labs.polsys.net and https://tools.digitalmethods.net

Analyzing Social Media with Digital Methods. Possibilities, Requirements, and Limitations

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Analyzing Social Media with Digital Methods. Possibilities, Requirements, and Limitations

Similar to Analyzing Social Media with Digital Methods. Possibilities, Requirements, and Limitations (20)

More from Bernhard Rieder

More from Bernhard Rieder (12)

Recently uploaded

Recently uploaded (20)

Analyzing Social Media with Digital Methods. Possibilities, Requirements, and Limitations

Editor's Notes