Working with Social Media Data: Ethics & good practice around collecting, using and storing data

Working with Social Media Data:
Ethics & Good Practice around
Collecting, Using and Storing Data
Nicola Osborne
Digital Education Manager, EDINA
Nicola.osborne@ed.ac.uk
@suchprettyeyes

Introductions: my social media work
• Digital Education Manager at EDINA, University of Edinburgh.
• Work on EDINA’s educational technology, innovation, digital and data projects
for audiences across Scotland, UK and further afield.
• Co-I on: PTAS-funded Managing Your Digital Footprints research strand (2014-
2015); Ongoing (2015-) Managing Your Digital Footprint research team; PTAS-
funded “A Live Pulse”: Yik Yak for understanding teaching, learning and
assessment at Edinburgh project.
• Co-tutor on ongoing Digital Footprint MOOC (2017-)
• Previously EDINA Social Media Officer (2009-2015), providing expertise and
advice on social media to colleagues across UoE for over 8 years.
http://edina.ac.uk/

Introduction: you and your work
1. Who are you?
2. What social media related research are you
working on or hoping to work on?
3. What do you hope to get out of today’s
session?

Overview
• Introduction & Design Considerations
– Approach
– Data accuracy
• Ethical Considerations
– Recommended ethical guidance
– Terms & Conditions – and impact on Data
– Consent and trust
• Practical Considerations
– Existing data sets
– Available data tools
– APIS
– Options for analysis and visualisation
• Storing and handling Data
– Compliance with legal requirements
– Sources of support
• Recommended researchers, groups, and resources.
• Q&A/Discussion – but questions welcome throughout!

Where to start…
• What is your research question(s)?
• Are social media or social media communities the
subject, or core to the subject?
• Or, is it the space for recruitment or reaching an
audience?
• Or, is it just a convenient space for data collection?

The Elephant (Blue Bird) in the Room
Image ©Twitter.com 2012

Research Design Considerations
• Research approach to be taken
• Appropriate data types to support your research
– Streaming/live data OR
– Archived / capture of data over time with asynchronous analysis
• Ethical considerations
• Consent process of subjects and their network
• Etiquette considerations
• Platform(s) to be used
– Fit with target subjects
– Terms & Conditions
• Practical access limitations e.g.
– Do tools for data capture exist?
– Does an API exist?
– What are the API limitations?
– Costs of access
• Your (researcher) or RAs expertise.
• Long term research vision – do you have rights to use
and reuse data in the ways you hope to?

Possible Methods &
Questions to Think About
• Computational (See also Batrinca and Treleaven 2015):
– Data access through APIs, screen scraping, established methods (e.g. DMI tools)?
– Text and data mining and/or Natural Language Processing (NLP)?
– Social network analysis and/or Actor Network Theory (ANT) analysis using nodes and edges in the network?
– Sentiment analysis based on text mining/NLP or based on presence/absence of emojis and/or visual content?
– Visual analysis and/or video or audio analysis for multimedia content?
• Quantitative (See also OII 2013a, b & c):
– Medium or large scale data?
– Automated or survey/volunteered data collection?
– Data cleansing process – how will you ensure that you have a good quality data set?
– What kind of statistical analysis do you want to take? Tools might include SPSS, NVIVO, Gephi, Tableu, etc.
– Will you be comparing to existing data sets and/or undertaking trend analysis over time?
– What standard tools in your field – for digital or non digital data – can you use to collect or interpret your data?
• Qualitative:
– Manual collection?
– Ethnographic approaches and/or participant observation
– Focus groups or similar?
– Critical/reflexive reading and coding of texts/content
Batrinca, B. and Treleaven, P.C., 2015. Social Media Analytics: a survey of techniques, tools and platforms. In AI & Society, 30 (1). Pp. 89-116. https://doi.org/10.1007/s00146-014-0549-4
Oxford Internet Institute, 2013a. Quantitative Methods in Social Media Research: Big Data. In OII YouTube Channel, 15 March 2013. See: https://www.youtube.com/watch?v=6hmjj7_1sSY
Oxford Internet Institute, 2013b. Quantitative Methods in Social Media Research: Populations and Sampling. In OII YouTube Channel, 15 March 2013. See: https://www.youtube.com/watch?v=6hmjj7_1sSY
Oxford Internet Institute, 2013c. Space-Time as a Sampling Condition for New Media Research. In OII YouTube Channel, 15 March 2013. See: https://www.youtube.com/watch?v=HNxn0PqOc8k

Is Social Media Data Representative?
• Not all people use social media (and some of the least privileged groups in society are not
online at all).
• Most social media data collection methods favour English language data in mainstream
US/Global sites. It is unusual to see multilingual research or research that acknowledges use
of content including non-English text by primarily English speakers.
• Privacy settings and publicness tend to reflect status and privilege. Accessing at-risk,
vulnerable, heavily trolled, and/or niche interest groups is more difficult than obtaining public
posts from middle class white male social media users. BAME communities, women’s groups,
LGBTQ+ communities, etc. tend to make higher use of private groups, group moderation, and
protective measures that require more qualitative and overt consent-based approaches.
• Not all social media users are active. There is an “activity and agency bias” (Lutz and Hoffman
2017) in much of the current research. Obtaining data on passive reads and engagement with
content is extremely difficult through quantitative methods. It may be easier with participant
observation.
Lutz, C. and Hoffman, C. P. 2017. The dark side of online participation: exploring non-passive and negative participation. In Information,
Communication & Society: AoIR Special Issue, 20 (6), pp. 876-897. http://dx.doi.org/10.1080/1369118X.2017.1293129

Question/Discussion
Which platform(s) are you intending to/are you
working with?
How did you select these social media spaces?

Ethical Considerations
• Visibility vs expectations of privacy:
– Being “in public” is not consent to being researched, their imagined audience may be quite different.
(see AoIR guidance, Marwick and boyd 2011)
– Are you engaging with private or “public” figures – expectations over visibility will vary significantly.
• How possible is it to obtain informed consent for work undertaken with your chosen social
media platform? How can consent be withdrawn?
• How will your data be collected and used? (Attributed vs Pseudonyms vs Anonymous).
• What personal data is being used? Does it put anyone at risk?
• What is the risk of accidental exposure or re-identification? Text snippets, quotes and images
may all be easily searchable.
• Public – or previously public – data can change in sensitivities over time.
• How will you handle/remove/retain subsequently deleted content
Marwick, A. and boyd, d., 2011. I tweet honestly, I tweet passionately: Twitter users, context
collapse, and the imagined audience. In new Media & Society, 12 (1), pp. 144-133. DOI:
10.1177/1461444810365313.

Recommended: AoIR Ethics Guidance
• AoIR Ethics Guidance (2012):
https://aoir.org/reports/ethics2.pdf
• AoIR Ethics Chart – a quick guide to
key issues:
https://aoir.org/aoir_ethics_graphic_2
016/
• AoIR Ethics Guidance (2002):
https://aoir.org/reports/ethics.pdf
• Annette Markham (co-author of AoIR
guidance) on Impact Models for
ethical decision making in data
research and design:
https://annettemarkham.com/2017/0
7/impact-model-ethics/

Recommended: Social Media
Research: A Guide to Ethics
• Excellent concise research ethics
guidance from the ESRC-
funded “Social Media, Privacy and
Risk: Towards More Ethical Research
Methodologies” project at
University of Aberdeen.
• Includes pointers to further social
media ethics resources.
• Townsend, L. and Wallace, C. 2016.
Social Media Research: A Guide to
Ethics. Aberdeen: University of
Aberdeen/ESRC Social Media
Enhancement project. Available
from: http://www.dotrural.ac.uk/soc
ial-media-research-ethics/

“But the data is already public”
In 2008 researchers released profile data (The T3 Data Set) from Facebook accounts of students at a US University,
inadvertently making identifiable data public, as reported in Zimmer (2010).
In this case the researchers:
• Had employed RAs who were part of the Network being examined and had (various levels of) access to more
information than a non-logged-in user of Facebook/user beyond the Network.
• Had funding that mandated open publishing and sharing of results.
• Had University but not individuals consent for data collection
• Combined Facebook with university housing data in their data sets
• Obscured the identity of the university where students were based, but described key characterstics
• Attempted to make all data anonymous by removing identifying information (name, student id, etc.) but left
network and behavioural information intact.
• Asked other researchers using the data not to attempt to reidentify subjects.
• Stated that “hackers” and “extreme effort” would be the only way to “crack” the data.
The university was identified swiftly based purely on the codebook and other writings about the data – but not
requiring direct access to the data. Once the university was identified, other specific identifying data (nationality, race,
home state, etc.), sometimes with only 1 individual in these groups, made re-identification of (some) students simple.
After public scrutiny and identification of the university, the data set was swiftly withdrawn by the researchers.
Zimmer, M. 2010. “But the data is already public”: on the ethics of research in Facebook. In Ethics and Information
Technology, 12 (4), December 2010, pp. 313-325. https://link.springer.com/article/10.1007%2Fs10676-010-9227-5

Terms & Conditions
• Before undertaking any social media research understand the T&Cs and
Developer T&Cs for the platform(s) you are looking at.
• Understand how your research aligns with the T&Cs, and any possible
issues of privacy, etiquette, or practical access.
• If your work is in conflict with T&Cs either re-design your research
(strongly recommended) or look carefully at risks and impacts.
• You should not ignore any T&Cs for technical reasons. If there is a valid
reason to ignore T&Cs for specific research reasons (such as research on
deleted tweets), be prepared to justify that to ethics boards and peer
reviewers. And understand that you may risk losing access to the platform
and your research data if you are found to be in breach of T&Cs.

Twitter Developer T&Cs of note (1)
Section VII (Other Important Terms), A: User Protection:
"Twitter Content, and information derived from Twitter Content, may not be
used by, or knowingly displayed, distributed, or otherwise made available
to:"…
"any entity for the purposes of conducting or providing surveillance, analyses
or research that isolates a group of individuals or any single individual for any
unlawful or discriminatory purpose or in a manner that would be inconsistent
with our users' reasonable expectations of privacy;"
https://developer.twitter.com/en/developer-terms/agreement-and-policy

Twitter Developer T&Cs of note (1)
Section VII (Other Important Terms), C: Respect Users' Control and Privacy:
"3. If Content is deleted, gains protected status, or is otherwise suspended,
withheld, modified, or removed from the Twitter Service (including
removal of location information), you will make all reasonable efforts to
delete or modify such Content (as applicable) as soon as reasonably
possible, and in any case within 24 hours after a request to do so by
Twitter or by a Twitter user with regard to their Content."
https://developer.twitter.com/en/developer-terms/agreement-and-policy

Facebook Statement of Rights &
Responsibilities
Section 5: Protecting Other People's Rights
"We respect other people's rights, and expect you to do the same.
1. You will not post content or take any action on Facebook that infringes or violates someone else's rights or
otherwise violates the law.
2. We can remove any content or information you post on Facebook if we believe that it violates this Statement or
our policies.
3. We provide you with tools to help you protect your intellectual property rights. To learn more, visit our How to
Report Claims of Intellectual Property Infringement page.
4. If we remove your content for infringing someone else's copyright, and you believe we removed it by mistake, we
will provide you with an opportunity to appeal.
5. If you repeatedly infringe other people's intellectual property rights, we will disable your account when
appropriate.
6. You will not use our copyrights or Trademarks or any confusingly similar marks, except as expressly permitted by
our Brand Usage Guidelines or with our prior written permission.
7. If you collect information from users, you will: obtain their consent, make it clear you (and not
Facebook) are the one collecting their information, and post a privacy policy explaining what
information you collect and how you will use it.
8. You will not post anyone's identification documents or sensitive financial information on Facebook.
9. You will not tag users or send email invitations to non-users without their consent. Facebook offers social
reporting tools to enable users to provide feedback about tagging."
https://www.facebook.com/terms.php

Trust in Social Networks
vs Trust in Research
Research Ethics – Randall Munroe/xkcd (https://xkcd.com/1390/) Licensed under CC-BY-NC 2.5
Trust in social media networks is mixed, with
users increasingly savvy about data use…
However…
• Social Media users can find observation by
academic researchers more disconcerting
than by the companies who own the
platforms.
• Research, depending on the topic, can feel
like a judgement on behaviours making
consent hugely important.
• The burden on researchers to be clear about
motives, funders, process, etc. is higher than
on commercial companies.
• There are parallels here to how individuals
feel about e.g. Tesco Clubcard or Credit Card
data capture vs. surveys and censuses.

Question/Discussion
What are the ethical concerns and
considerations for your current (or previous)
social media research?

Obtaining Consent
• Consent may be implicitly included for API data access in some terms and
conditions BUT, when did you last read the terms and conditions? What about
your research participants?
So:
• Obtain explicit consent wherever possible.
• Be transparent if you are engaging in research in a space – with a pinned post, link
to your participant information sheet, etc.
• Consent can be tricky in anonymous and less traditional social media spaces (see
e.g. Osborne 2017 for approaches used with Yik Yak).
• Apply particular caution to gaining consent for screen shots, attributed posts,
reproducing exact images or text of posts etc.
Osborne, N. 2017. Addressing ethics of research in anonymous online spaces. In “A Live Pulse”: Yik Yak for understanding
teaching, learning and assessment at Edinburgh [blog], 13th July 2017.
http://yikyakresearch.blogs.edina.ac.uk/2017/07/13/addressing-ethics-of-research-in-anonymous-online-spaces/

Some Common Ethics Pitfalls
• Researcher assumes public data can be used in any way desired, without
considering the subject(s) intent when originally sharing their profile/post etc.
• Researcher explores conveniently available “public” data without realising that
privacy settings may make more information available to them, than is truly
“public”.
• Researcher is using “big” data under belief that individuals will not be identifiable
(as in the “But the data is already public” case).
• Research subject(s) has shared data on a public site but is not aware of their own
settings, or has not checked them lately, making implicit consent and the public
nature of the data problematic. Discovering that they have been included in
published research may be upsetting and problematic.
• Research Ethics Committees and/or Journal Editorial Boards are unaware or do not
properly consider that social media data includes real names, pseudonyms,
locations, highly disclosive data and do not ask the right questions around the
consent process, collection, aggregation, storage and retention of data.
• Researcher uses full text of a post as an “anonymous” example but this is then
Googled which identifies the original post/tweet/content and individual.

Data Considerations
• What kind of research approach are you taking?
• Who or what is the subject of your research – what is the right social media space to capture
appropriate data?
• What scale of data are you looking to collect/harvest? (If working with big data see boyd &
Crawford 2012)
• Will you be sampling or looking to collect all data over a specific time period?
• How sensitive is the topic?
• What level and type of consent can you obtain from participants?
• What kind of content?
– Profiles – for network analysis, image analysis, qualitative review of content through profile components/data?
– Posts – through API/data feed/harvesting or observation? Textual, visual, multimedia? Manual coding or text/data
mining?
– Comments/discussion – contents or threads of discussion?
– Metadata – tags, likes, engagements?
• Time bounds – how long do you expect to collect data for?
• What use will you make of the data after capture?
boyd, d. and Crawford, K., 2012. Critical questions for big data. In Information, Communication & Society special issue: A decade in
internet time: the dynamics of the internet and society, 15 (5). http://dx.doi.org/10.1080/1369118X.2012.678878

Sources of baseline data on usage,
access, trends, literacies etc.
• Oxford Internet Surveys: biennial data on UK public use and attitudes to the
internet, including social media: http://oxis.oii.ox.ac.uk/research/dataset-request/
• Ofcom research and data: Regular reporting on UK public use and attitudes to
media, including internet and social media: https://www.ofcom.org.uk/research-
and-data/search. Includes:
– Annual adult media use and attitudes, and children’s media literacy reporting:
https://www.ofcom.org.uk/research-and-data/media-literacy-research;
– Communications Market Report: annual overview at consumer use of communications of all types:
https://www.ofcom.org.uk/research-and-data/multi-sector-research/cmr
– Further regular and one-off data via the statistical release calendar:
https://www.ofcom.org.uk/research-and-data/data/statistics
• Pew Internet & American Life datasets: data on US public use, knowledge and
understanding of the web, digital literacy, social media, etc:
http://www.pewinternet.org/datasets/. For example:
– Social Media Update 2016: http://www.pewinternet.org/2016/11/11/social-media-update-2016/

Sources of Official Social Media
Usage Data, Trends, Financials, etc.
Best sources are quarterly earnings reports and presentations, typically including: monthly active
users, usage trends, earnings, monetization strategies, financials, future plans:
• Facebook & Instagram & WhatsApp: https://investor.fb.com/home/default.aspx
• Twitter: https://investor.twitterinc.com/results.cfm
• SnapChat: https://investor.snap.com/events-and-presentations/events
• YouTube/Google via Alphabet: https://abc.xyz/investor/
• Flickr:
– Currently owned by Oath, should be via Verizon once deal closes:
http://www.verizon.com/about/investors
– Historical up to 2017, via Yahoo captures in the Internet Archive:
https://web.archive.org/web/*/https://investor.yahoo.net/index.cfm
• LinkedIn:
– Current via Microsoft: https://www.microsoft.com/en-us/investor/
– Historical up to 2016: https://news.linkedin.com/topic/earnings
• Weibo: http://ir.weibo.com/phoenix.zhtml?c=253076&p=irol-irhome

Privately Held Social Media
• Crunchbase (https://www.crunchbase.com/) is a good source of
information on shareholders/owners, acquisitions, finances, etc.
• Alexa web rankings (owned by Amazon) give an overview of usage levels
and trends based on ranking relative to other sites in the US, and globally.
• Social Media sites’ “business” and “press” sites, official blogs and news
releases are best for user data.
• Some social media provide advertising APIs – which may be usable for
research depending on T&Cs and data content - but not developer or open
APIs, e.g. Snapchat: https://www.snap.com/en-GB/news/post/third-party-
applications-and-the-snapchat-api/
e.g:
– Pinterest:
• data on usage from Pinterest: https://business.pinterest.com/en
• Alexa data on usage: https://www.alexa.com/siteinfo/pinterest.com
• investor data:
https://www.crunchbase.com/organization/pinterest/investors/investors_list

Data Quality & Reliability
• Data sources and APIs can change regularly, and what is available may change over time (e.g.
Twitter moved from all to “Top” tweets some years ago for its API; Facebook have changed data
structures multiple times).
• Errors in automated data collection can be hard to spot until analysis is undertaken – sampling, trial
data collection, and review of code by colleagues can all be useful.
• Gaps in data may occur because there are genuine gaps in data creation/posting etc; because there
are technical issues with the social media service; because of an error in your code; or because you
are over your API rate limit for the minute/hour/day.
• Data may change over time – Facebook and Instagram allow posts to be edited so a request will
capture one moment in time not necessarily the original or final versions.
• Data may disappear over time. Notable example: the Twitter deletions terms and conditions means
that deleted tweets will not appear in a later API call.
– Research tools obeying the T&Cs will also update and remove deleted tweets.
– Research tools retaining deleted tweets are technically in breach of the T&Cs.
• Acquisitions, Mergers, and shut downs of social media sites can lead to changed terms and
conditions, changes to data availability and use, changes or removals of APIs and data access
routes, changes to user presence in a space, acceptable norms within a space (important for
qualitative work particularly).

Hidden pre-filtering and sampling
• Not all social media posts are equally likely to be included in standard API
endpoints
– e.g. a Twitter user with few posts and few followers is unlikely to appear on a
popular hashtag.
– The standard "Streaming" and "Search" APIs include 1% of Tweets and varies
in accuracy depending on activity/time etc. (See Morstatter et al 2013).
• Privacy settings will reduce the accuracy of any data sampled from
Facebook or other more complex privacy networks but it is hard to see
what is being excluded.
Morstatter, F., Pfeffer, J., Liu, H. and Carley, K.M., 2013. Is the Sample good enough?
Comparing data from Twitter's streaming API with Twitter's Firehose. In ICWSM 2013
and eprint arXiv:1306.5204. Available from: https://arxiv.org/abs/1306.5204

Question/Discussion
Have you already tried obtaining data for the
social media space you are using in your
research?
Have you faced any challenges or obstacles?

Existing Data Sets
• “The Zuckerberg files”: digital archive of all public comments by Mark Zuckerberg
including social media and mainstream media content for research use:
https://www.zuckerbergfiles.org/
• FiveThirtyEight Data: archive of data associated with FiveThirtyEight articles,
including social media data sets: https://github.com/fivethirtyeight/data
• Lumen database – tracking legal notices and complains for removal of online
materials (including social media content): https://www.lumendatabase.org/
• CSIRO (Australia’s national science agency) We Feel – emotions in Tweets – API:
http://wefeel.csiro.au/#/api (see:
http://datadrivenjournalism.net/resources/we_feel)
• Stanford Large Network Dataset Collection - includes social network data
sets: https://snap.stanford.edu/data/
• Network Repository – network datasets, including social media, Facebook and
Twitter networks: http://networkrepository.com/
• DocNow – social justice social network archives: http://www.docnow.io/

Cross-site data tools
• North Caroline Social Media Archive
Toolkit: https://www.lib.ncsu.edu/social-media-archives-toolkit; see
also: https://github.com/NCSU-Libraries/Social-Media-Combine
• Social Mention (search engine for social media) API:
http://www.socialmention.com/api/
• Scrapebox (premium tool) YouTube Downloader:
http://www.scrapebox.com/youtube-downloader and Social Account
Scraper: http://www.scrapebox.com/social-account-scraper
• ESRC COSMOS Open Data Tools (available but no longer updated since
2014): http://socialdatalab.net/software
• Overview of Twitter data tools (Ahmed
2015): http://blogs.lse.ac.uk/impactofsocialsciences/2015/07/10/social-
media-research-tools-overview/

Recommended: DMI Tools
The Digital Methods Initiative add new (documented) tools all the time, including:
• Censorship Explorer – determine censorship in various regions through URLs & proxies.
• Discus (Disqus) Comment Scraper – obtain data from the Discus comment plugin.
• Expand Tiny URLs – automatically expand large collections of Tiny URLs (e.g .from tweets).
• Geo IP – translate URLs or IP addresses into geographic locations (e.g. for a blog).
• Instagram Hashtag Explorer – retrieve Instagram media via specific hashtags.
• Issue Crawler – uses URLs to analyse relationships and connections through links between
URLs.
• Netvizz (Facebook) – extracts data from Facebook around groups, pages, search.
• Pinterest Scraper – scrapes Pinterest URLs and captures metadata of pins.
• Tumblr – data capture based on a Tumblr tags which retrieves metadata and co-incident tags.
• Twitter Capture and Analysis Toolset (DMI-TCAT) – robust and reproducible tool for data
capture and analysis of Twitter data. Source code available for local use.
• YouTube Data Tools – extract data on YouTube channels and videos, e.g. channel networks.
Access documentation and DMI tools at: https://wiki.digitalmethods.net/Dmi/ToolDatabase
See also, DMI Protocols: https://wiki.digitalmethods.net/Dmi/DmiProtocols

https://github.com/digitalmethodsinitiative/dmi-tcat/wiki

Internet Archive & WayBackMachine
• Global archive capturing websites (to various levels of detail/depth) based on IA
targets and user-submitted requests (since 2001).
• You can request a site for archiving, or a group of sites.
• Searchable resource OR can use exact URL to retrieve previous archived pages
(WayBackMachine).
• Collections exist for various social media collections, e.g:
– 2016 US Presidential Election Social Media: https://archive.org/details/2016electiontwitter
– Arab America on Social Media: https://archive.org/details/ArchiveIt-Collection-2797
– Gif Cities (Gifs from GeoCities): https://gifcities.org/
• Great for social media website changes, blogs, terms and conditions versions, etc.
• Sites available in a range of archive formats (IA), or as viewable pages
(WayBackMachine).
• See:
– https://archive.org/
– https://archive.org/web/

https://gifcities.org/?q=star+wars

UK Web Archive
• Run by the British Library (since 2004).
• Indexes (UK/related) sites to a greater depth than the Internet Archive.
• Smaller archive.
• You can request a site for archiving.
• Special Collections include:
– UK Blogs:
https://www.webarchive.org.uk/ukwa/collection/100698/page/1/source/colle
ction
– London Terror Attacks, 2005 (mainstream and social media commentary):
https://www.webarchive.org.uk/ukwa/collection/100757/page/1/source/colle
ction
– Olympic & Paralympic Games 2012 (mainstream and social media):
https://www.webarchive.org.uk/ukwa/collection/4325386/page/1/source/coll
ection
• See: https://www.webarchive.org.uk/ukwa/.

Other Web Archive Resources
• Rhizome: archiving for internet art, including interactive works
engaging with/critiquing social media: http://rhizome.org/
• Note: EDINA are currently working on an archiving tool for
researchers, ask me for more info on Site2Cite.

Using APIs to obtain Data
• APIs (Application Programming Interfaces) exist for most
social media sites and allow direct requests for data.
• Some unofficial APIs exist for sites without official/open
APIs. Use only with caution as these frequently have
privacy, security or legal issues.
• Consider working with text and data mining colleagues, or
developers, to seek additional ways to capture data such
as:
– Screen scraping (automated capture of pages from a user
perspective).
– Mobile data collection or data capture approaches to social
media.
– Internet archiving approaches using standard tools or code
libraries

Glossary: Data Request Terms
• API: Application Programming Interface – a way to request data from a web service.
• REST or RESTful API: REST stands for “Representational State Transfer” and means an API that uses
HTTP (the protocol for accessing websites) requests (or “calls”) to:
– GET – read access to content such as posts, users, etc. This is the main request you would use to retrieve
data.
– PUT – update or replace data.
– POST – create new data (such as a post to a blog, a wiki page, etc.).
– DELETE – Delete content.
• An API Endpoint – is essentially the way to address and structure what kind of request you are
making. E.g. home_timeline vs user_timeline. Each endpoint provides a different entry to the data
behind a web service.
• In a REST GET request you may have:
– Fields – the various fields of data you want to retrieve, e.g. link, message, post, etc. These are usually shown
in the Developer Documentation.
– Modifiers or Parameters - these act like filters, limiting the request in a specific way, e.g. only retrieving
posts with a location attached.
– Operators – are the various standard terms/labels for content and content types that you can use in your
GET request to shape and customise it, for instance this might include “retweets_of” or “bio” or “has:links”
etc.
• Other types of APIs and M2M (Machine-to-Machine) interfaces exist including “SOAP” and “RPC”.
• SDK is Standard Developer Kit and is used increasingly often as a way to package various requests
for developers to use in web or mobile apps (SDKs has been used as a term for the coding tools for
smartphone platforms iOS and Android for years).

Locating or Requesting
Social Media Data
ProgrammableWeb (https://www.programmableweb.com/) is a great source for API
information for social media sites:
• Instagram Developer: https://www.instagram.com/developer/
– API Endpoints: https://www.instagram.com/developer/endpoints/
• Twitter Developer: https://developer.twitter.com/
– APIs: https://developer.twitter.com/en/docs
– GNIP: http://support.gnip.com/apis/ - premium "Firehose" access. See also Twitter
Enterprise: https://developer.twitter.com/en/enterprise
– Free APIs cover 7 days tweets; Premium APIs exist for 30-day search and full archive search.
– Facebook for Developers: https://developers.facebook.com/
– API (Graph API): https://developers.facebook.com/docs/graph-api/
• YouTube Developers: https://developers.google.com/youtube/
– APIs (Comments and Comment Threads particularly useful):
https://developers.google.com/youtube/v3/docs/
• Weibo API: http://open.weibo.com/wiki/API%E6%96%87%E6%A1%A3/en

How do you make an API call?
• For open RESTful APIs you can enter an HTTP request in any browser
window, e.g. http://services.groupkt.com/state/get/USA/all
• Most social media APIs now require you to register your app, request a key
from them and for you to include the access tokens in your request.
• In general API calls are made from within a small programme – this might
be running on your machine or from a browser based coding tool.
• Lots of existing tools based on social media APIs exist – see later slide for a
sample of these.
• Try it out:
– Codecademy Twitter API tutorial:
https://www.codecademy.com/en/tracks/twitter

An API Endpoint is a bit
like a vending machine…
Vending machine priced by grams of fat, Google, San
Jose, California.jpg by Flickr user Cory Doctorrow.
You have to use the right machine to get hold of the item
you want, then you have to enter the right code and the
right price to get your candy.
• Each item has a name, and a standard way to access it
(in a vending machine this is the item code).
• Each item has a value (in a vending machine this is the
delicious edible contents of each item).
• Each item requires some sort of trust exchange before
you can access it (in a vending machine this is cash).
• In an API that “E12” item code is actually going to look
more like:
https://api.twitter.com/1.1/statuses/user_timeline.jso
n?screen_name=twitterapi&count=2
• In an API the price is usually a unique key/access
token that is unique to you and your app – that
indicates a legitimate request and who it’s from.
Bonus: In APIs there is usually a huge range of data
(research candy) to ask for, and lots of filtering options.

What will you get back from
an API GET request?
Assuming it has worked correctly, something like this…
See the full example at: https://developer.twitter.com/en/docs/tweets/timelines/api-
reference/get-statuses-user_timeline.html
Each of these is a new field for a single
tweet and it’s value.
[] is an empty field (e.g. no hashtag on this
tweet).

This data can then be processed by your app, or simply
retrieved and stored in a database or spreadsheet…

Recommended Tool:
Martin Hawksey’s TAGS
• Uses Google Docs to capture tweets based on a hashtag, search term, user, etc.
• Can be automated to allow rolling capture.
• Useful for capturing a sample of long term community dialogues or public
discourse where Top Tweets/7 day limits will be acceptable.
• Includes spreadsheet; visualisation; searchable archive - latter two options are
only available if you make data (semi) public.
• Uses Twitter API – takes “Top” rather than “Latest” tweets so accuracy depends on
popularity of content/hashtags.
• Well documented and supported by Martin.
• A great way to dip your toe in the API water – you have to obtain a key the first
time you run TAGs, and can access and look at the code it runs. You can also make
more advanced use of the tool and automation connecting it to other
visualisations and analysis tools.
• See: https://tags.hawksey.info/
• Support: https://tags.hawksey.info/forums/

Question/Discussion
Do you have any experience or
recommendations for social media data
collection tools or approaches?
Have you attended one of the Digital Scholarship
sessions where CAHSS researchers can meet
with developers and data specialists?
[Recommended!]

Analysis & Visualisation
Further information, tutorials etc. online and/or running through Digital Scholarship and Schools Research Methods training.
• Nvivo (http://www.qsrinternational.com/nvivo/) – Premium qualitative data analysis software with social media and multimedia
support, collaborative working also supported. Feature rich. Training available. Available through UoE/CAHSS license:
https://www.ed.ac.uk/information-services/computing/desktop-personal/software/main-software-deals/nvivo.
• IBM SPSS (https://www.ibm.com/analytics/us/en/technology/spss/) – Premium data analysis tool for surveys and particularly for
quantitative data, widely used in social sciences. Available through UoE license: https://www.ed.ac.uk/information-
services/computing/desktop-personal/software/main-software-deals/spss.
• Dedoose (http://www.dedoose.com/) – Premium qualitative data analysis software with simple interface, tagging, annotation and
exploration options.
• Chorus (http://chorusanalytics.co.uk/) – Free software for data harvesting and analytics for social science research using Twitter data
• Gephi (https://gephi.org/) – Visualisation and exploration of multiple data types, particularly good for network analysis. Feature rich so
a bit of a learning curve. Free download.
• D3 visualisation libraries (see: https://github.com/d3/d3/wiki/gallery) – Free collection of Javascript libraries for use in data
visualisation and exploration of multiple data types.
• NodeXL (https://nodexl.codeplex.com/) – Free network visualisation tool for Excel. Free.
• TAGS Explorer (https://tags.hawksey.info/) – Twitter only visualisations of networks (using NodeXL) and searchable timeline archive
explorations. Free.
• Textal (http://www.textal.org/) – Text analysis tools for mobile use with Twitter streams, websites (inc. blogs), and documents. Free.
• Tableau (https://www.tableau.com/) – Visualisation of multiple data sources and types. Free trial, otherwise monthly subscription.
A large quantity of open source tools and software are available. Search for these or look at the Journal of Open Research Software
(https://openresearchsoftware.metajnl.com/) or the Journal of Open Source Software (http://joss.theoj.org/) for well documented research-
driven examples. See also Tony Hirst’s OU Useful Blog (https://blog.ouseful.info/) for visualisation approaches. There are also many marketing
packages for social media analysis which could be used/adapted for research where their processes are well documented.

Appropriate Handling & Storage
• Data is usually returned with unique identifiers that can be easily
traced back to the original poster/subject.
• The unique identifiers connect conversations and posts so are hard
to strip away entirely – although you could try a one-way hash of
the data to mask the identifiable information but retain
connections.
• Short posts and tweets are highly identifiable. Try Googling or
searching Twitter for a recent tweet to see that in action.
• Images and videos can also be relatively easily compared/reverse
image searched and therefore identifiable.
• Think about which fields you actually need to retain for your
research question(s).
• Plan how long you will keep your data, and how you will keep it
secure - where and how you store your data really matters.

Data Protection & GDPR
• Be aware of current Data Protection (Data Protection Act 1998) guidance
on the use, storage and retention of personal data.
• From 25th May 2018 the General Data Protection Regulation (GDPR)
comes into effect with:
– Increased rights for individuals to understand the use, access, rectification, erasure,
rights to restrict processing, portability, and rights to object to the use of their data.
– Increased legal measures for organisations breaching GDPR guidance.
• Ensure your Consent process, your Research Data Management plans, and
your use, access and disposal of data is compliant.
• By default social media APIs provide a lot of data:
– What is the minimum data you require?
– Removing unneeded data at the point of collection and/or data cleaning will
help reduce any risks of exposure or non-compliance with data protection
legislation.
See:
• Data Protection Act 1998: https://www.legislation.gov.uk/ukpga/1998/29/contents
• ICO guidance: https://ico.org.uk/for-organisations/data-protection-reform/overview-of-the-gdpr/

Local Support
• Research Data Mantra – self-led course on Research Data Management,
including appropriate handling, storage and planning for onward
preservation, sharing or destruction: http://mantra.edina.ac.uk/
• Data Store – secure storage for active research data, available to all staff
and PGR students: https://www.ed.ac.uk/information-services/research-
support/research-data-service/working-with-data/data-storage
• Working with Sensitive Data – guidance and further resources on working
with sensitive and personal data: https://www.ed.ac.uk/information-
services/research-support/research-data-service/working-with-
data/sensitive-data
• Information Security Team – guidance on legal and technical approaches
to keeping data secure and appropriately encrypted and disposed of:
https://www.ed.ac.uk/infosec

Making Research Data Open
• If you have a consent process in place, ensure you request consent for any onward use you
expect to make of your data. And ensure there is a process to withdraw consent for onward.
• Beware verbatim quoting in publications – it can be easy to search back to the original text.
– Public figures who would consider their social media content a publication and part of their profile
(e.g. politicians) are more appropriate to quote, where needed.
– Even if anonymous/not attributed it is safer to paraphrase short comments where possible to make
reverse searching more challenging.
• Screenshots of posts often reveal the subject name, image, location, and their contacts. Only
use these where appropriate, properly consented to, and where you are not placing your
subjects at risk.
• Consider the timelag between data collection and any publication. Is your consent from
participants still valid if a year has passed? What about 2 years? Or 5 years? A teen
participant may feel differently about data being exposed when they are, for instance, a
newly qualified lawyer or medic with very different reputational considerations.
See also: University of North Carolina at Chapel Hill and UoE Research Data Management and Sharing
(Coursera): https://www.coursera.org/learn/data-management

Courses and Information
• DMI Digital Methods online course:
https://wiki.digitalmethods.net/Digitalmethods/WebHome
• UCL Why We Post: the Anthropology of Social Media course (FutureLearn):
https://www.futurelearn.com/courses/anthropology-social-media
• QUT Social Media Analytics: Using Data to Understand Public
Conversations (FutureLearn):
https://www.futurelearn.com/courses/social-media-analytics
• Rutgers University Social Media Data Analytics (Coursera):
https://www.coursera.org/learn/social-media-data-analytics
• Doing Journalism with Data: First steps, skills and tools:
http://learno.net/courses/doing-journalism-with-data-first-steps-skills-
and-tools
• UoE Digital Footprint MOOC – understand some of the challenging
identity, privacy and ethical concerns around social media for you and
your research subjects: https://www.coursera.org/learn/digital-footprint/

Useful Niche Resources
• Utrecht Data School Data Ethics Decision Aid (DEDA):
https://dataschool.nl/research/deda/?lang=en
• Programming Historian Data Mining the Internet Archive lesson:
https://programminghistorian.org/lessons/data-mining-the-internet-archive
• Insight News Lab Social Network Analysis and Visualisation for #RDAPlenary 3
(using ScraperWiki and OpenRefine): http://hujo.deri.ie/rdaplenarysn/
• Tony Hirst First Baby Steps to Anonymising Data with Open Refine:
https://blog.ouseful.info/2015/01/23/anonymising-data-with-open-refine/
• Tony Hirst Social Interest Positioning – Visualising Facebook Friends’ Likes with
Data Grabbed Using Google Refine: https://blog.ouseful.info/2012/01/04/social-
interest-positioning-visualising-facebook-friends-likes/
• Tony Hirst Grabbing Twitter Search Results into Google Refine and Exporting
Conversations into Gephi  needs updating for new Twitter API:
https://blog.ouseful.info/2012/10/02/grabbing-twitter-search-results-into-google-
refine-and-exporting-conversations-into-gephi/

Local research and expertise
(a small sampling thereof!)
• Social media, Digital Ethnography, Sociological research methods– Kate Orton Johnsone (Sociology)
• Social Media, Digital Labour – Karen Gregory (Sociology).
• Communities on the Darknet, illicit markets and cultures – Angus Bancroft (Sociology).
• Social media in education; bots; anonymity in social media – Sian Bayne (Research in Digital
Education Centre, Moray House).
• Digital cultural heritage learning and engagement– Jen Ross (Research in Digital Education Centre,
Moray House); Claire Sowton (CAHSS); Melissa Terras (UCL/CAHSS).
• Text and data mining of social media content – Claire Grover (Informatics); Richard Tobin
(Informatics); Clare Llewellyn (Informatics; Neuropolitics Research, SPS).
• Sharing of photography, autobiographical memory and distributed cognition (inc. social media) –
Tim Fawns (Clinical Education, Centre for Medical Education, MVM).
• Big data (inc. social media) in healthcare – Mhairi Aitken (Usher Institute, MVM).
• Social media, Digital Footprint, blogging and Buddhism – Louise Connelly (Vet School).
• Mobility, mobile technology, formal and informal education communities around the world –
Michael Sean Gallagher (Research in Digital Education Centre, Moray House).
• Playful learning in informal digital environments – Clara O’Shea (Research in Digital Education
Centre, Moray House).
• Social media and politics – Neuropolitics Research group: Laura Cram (Politics, SPS); Robin Hill
(Informatics; SPS); Sujin Hong (SPS); Adam Moore (PPL).
• Visualisation of big data, including network analysis – Benjamin Bach (Design Informatics).
• Social Media and scholarly communities–Sara Shinton (IAD); James Stewart (SPS).

Recommended work
& groups researching in this area
• UoE Beyond Text Network – interdisciplinary network for social media and multimedia researchers:
https://www.wiki.ed.ac.uk/display/DIG/Beyond+Text
• UoE Informatics Language Technology Group – text mining expertise working on projects including topic modelling and
social media analysis: https://www.ltg.ed.ac.uk/
• Digital Methods Initiative (DMI) (European multi-organisation research group):
https://wiki.digitalmethods.net/Dmi/DmiAbout
• Microsoft Research Social Media Collective (US) – particularly danah boyd, Nancy Baym and Kate Crawford’s work:
https://www.microsoft.com/en-us/research/group/social-media-collective/
• #NSMNSS: New social media, new social science? - great blog reflecting on social science methods around social
media http://nsmnss.blogspot.co.uk/
• Oxford Internet Institute – particularly strong on relationships to mainstream media environment: https://www.oii.ox.ac.uk/
• Visual Social Media Lab (Sheffield) – led by Farida Vis: http://visualsocialmedialab.org/
• DocNow – social justice social media archiving: http://www.docnow.io/
• Data Driven Journalism (European Journalism Centre and Netherlands): http://datadrivenjournalism.net/
• Analysing Social Media Collaboration (UK cross-institution group, site now dormant) – responsible for the high profile
“Reading the Riots” Twitter analysis work in 2011: http://www.analysingsocialmedia.org/home
• Michael Zimmer – influential work on privacy, leading projects on privacy and Facebook: http://www.michaelzimmer.org/
• Electronic Freedom Foundation –advocates with expertise on privacy and tracking in social media: https://www.eff.org/
• Centre for Social Media Research (University of Westminster): https://www.westminster.ac.uk/social-media-research
• Digital Media and Society Research Group (Cardiff): https://www.cardiff.ac.uk/research/explore/research-units/digital-
media-and-society
• COSMOS (legacy page for Cardiff research group): http://www.cs.cf.ac.uk/cosmos/

Recommended Journals
• First Monday (University of Illinois at Chicago): http://firstmonday.org/index
• New Media & Society (Sage): http://journals.sagepub.com/home/nms
• Information, Communication & Society (Taylor & Francis):
http://tandfonline.com/toc/rics20/current
• Social Media + Society (Sage): http://journals.sagepub.com/home/sms
• Big Data & Society (Sage): http://journals.sagepub.com/home/bds
• Policy & Internet (Wiley):
http://onlinelibrary.wiley.com/journal/10.1002/(ISSN)1944-2866
• Journal of Computer-Mediated Communication (Wiley):
http://onlinelibrary.wiley.com/journal/10.1111/(ISSN)1083-6101
• Cyberpsychology, Behaviour, and Social Networking (Mary Ann Liebert Inc.):
http://online.liebertpub.com/loi/CYBER
• Journal of Broadcasting & Electronic Media (Taylor & Francis):
http://www.tandfonline.com/toc/hbem20/current

Relevant Upcoming
Digital Scholarship Sessions
• Digital Research Clinics and Resources (26th October 2017)
• Cleaning Data with Open Refine (1st November 2017)
• Regex: Regular Expressions (23rd November 2017)
• Introduction to Sentiment Analysis: What it is and how to do it simply
(14th December 2017)
Look out for further sessions and/or contact the team with any specific
requests: http://www.digital.cahss.ed.ac.uk/

Questions & Discussion
Or follow up after today: nicola.osborne@ed.ac.uk

Working with Social Media Data: Ethics & good practice around collecting, using and storing data

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Working with Social Media Data: Ethics & good practice around collecting, using and storing data

Similar to Working with Social Media Data: Ethics & good practice around collecting, using and storing data (20)

More from Nicola Osborne

More from Nicola Osborne (6)

Recently uploaded

Recently uploaded (20)

Working with Social Media Data: Ethics & good practice around collecting, using and storing data

Editor's Notes