Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Working with Social Media Data: Ethics & good practice around collecting, using and storing data

24 views

Published on

Slides from a workshop delivered for the University of Edinburgh Digital Scholarship programme, on 18th October 2017. For further information on the programme see: http://www.digital.cahss.ed.ac.uk/ or #DigScholEd. If you are interested in hosting a similar workshop, or adapting these slides please contact me: nicola.osborne@ed.ac.uk.

Published in: Education
  • Be the first to comment

  • Be the first to like this

Working with Social Media Data: Ethics & good practice around collecting, using and storing data

  1. 1. Working with Social Media Data: Ethics & Good Practice around Collecting, Using and Storing Data Nicola Osborne Digital Education Manager, EDINA Nicola.osborne@ed.ac.uk @suchprettyeyes
  2. 2. Introductions: my social media work • Digital Education Manager at EDINA, University of Edinburgh. • Work on EDINA’s educational technology, innovation, digital and data projects for audiences across Scotland, UK and further afield. • Co-I on: PTAS-funded Managing Your Digital Footprints research strand (2014- 2015); Ongoing (2015-) Managing Your Digital Footprint research team; PTAS- funded “A Live Pulse”: Yik Yak for understanding teaching, learning and assessment at Edinburgh project. • Co-tutor on ongoing Digital Footprint MOOC (2017-) • Previously EDINA Social Media Officer (2009-2015), providing expertise and advice on social media to colleagues across UoE for over 8 years. http://edina.ac.uk/
  3. 3. Introduction: you and your work 1. Who are you? 2. What social media related research are you working on or hoping to work on? 3. What do you hope to get out of today’s session?
  4. 4. Overview • Introduction & Design Considerations – Approach – Data accuracy • Ethical Considerations – Recommended ethical guidance – Terms & Conditions – and impact on Data – Consent and trust • Practical Considerations – Existing data sets – Available data tools – APIS – Options for analysis and visualisation • Storing and handling Data – Compliance with legal requirements – Sources of support • Recommended researchers, groups, and resources. • Q&A/Discussion – but questions welcome throughout!
  5. 5. Where to start… • What is your research question(s)? • Are social media or social media communities the subject, or core to the subject? • Or, is it the space for recruitment or reaching an audience? • Or, is it just a convenient space for data collection?
  6. 6. The Elephant (Blue Bird) in the Room Image ©Twitter.com 2012
  7. 7. Research Design Considerations • Research approach to be taken • Appropriate data types to support your research – Streaming/live data OR – Archived / capture of data over time with asynchronous analysis • Ethical considerations • Consent process of subjects and their network • Etiquette considerations • Platform(s) to be used – Fit with target subjects – Terms & Conditions • Practical access limitations e.g. – Do tools for data capture exist? – Does an API exist? – What are the API limitations? – Costs of access • Your (researcher) or RAs expertise. • Long term research vision – do you have rights to use and reuse data in the ways you hope to?
  8. 8. Possible Methods & Questions to Think About • Computational (See also Batrinca and Treleaven 2015): – Data access through APIs, screen scraping, established methods (e.g. DMI tools)? – Text and data mining and/or Natural Language Processing (NLP)? – Social network analysis and/or Actor Network Theory (ANT) analysis using nodes and edges in the network? – Sentiment analysis based on text mining/NLP or based on presence/absence of emojis and/or visual content? – Visual analysis and/or video or audio analysis for multimedia content? • Quantitative (See also OII 2013a, b & c): – Medium or large scale data? – Automated or survey/volunteered data collection? – Data cleansing process – how will you ensure that you have a good quality data set? – What kind of statistical analysis do you want to take? Tools might include SPSS, NVIVO, Gephi, Tableu, etc. – Will you be comparing to existing data sets and/or undertaking trend analysis over time? – What standard tools in your field – for digital or non digital data – can you use to collect or interpret your data? • Qualitative: – Manual collection? – Ethnographic approaches and/or participant observation – Focus groups or similar? – Critical/reflexive reading and coding of texts/content Batrinca, B. and Treleaven, P.C., 2015. Social Media Analytics: a survey of techniques, tools and platforms. In AI & Society, 30 (1). Pp. 89-116. https://doi.org/10.1007/s00146-014-0549-4 Oxford Internet Institute, 2013a. Quantitative Methods in Social Media Research: Big Data. In OII YouTube Channel, 15 March 2013. See: https://www.youtube.com/watch?v=6hmjj7_1sSY Oxford Internet Institute, 2013b. Quantitative Methods in Social Media Research: Populations and Sampling. In OII YouTube Channel, 15 March 2013. See: https://www.youtube.com/watch?v=6hmjj7_1sSY Oxford Internet Institute, 2013c. Space-Time as a Sampling Condition for New Media Research. In OII YouTube Channel, 15 March 2013. See: https://www.youtube.com/watch?v=HNxn0PqOc8k
  9. 9. Is Social Media Data Representative? • Not all people use social media (and some of the least privileged groups in society are not online at all). • Most social media data collection methods favour English language data in mainstream US/Global sites. It is unusual to see multilingual research or research that acknowledges use of content including non-English text by primarily English speakers. • Privacy settings and publicness tend to reflect status and privilege. Accessing at-risk, vulnerable, heavily trolled, and/or niche interest groups is more difficult than obtaining public posts from middle class white male social media users. BAME communities, women’s groups, LGBTQ+ communities, etc. tend to make higher use of private groups, group moderation, and protective measures that require more qualitative and overt consent-based approaches. • Not all social media users are active. There is an “activity and agency bias” (Lutz and Hoffman 2017) in much of the current research. Obtaining data on passive reads and engagement with content is extremely difficult through quantitative methods. It may be easier with participant observation. Lutz, C. and Hoffman, C. P. 2017. The dark side of online participation: exploring non-passive and negative participation. In Information, Communication & Society: AoIR Special Issue, 20 (6), pp. 876-897. http://dx.doi.org/10.1080/1369118X.2017.1293129
  10. 10. Question/Discussion Which platform(s) are you intending to/are you working with? How did you select these social media spaces?
  11. 11. Ethical Considerations • Visibility vs expectations of privacy: – Being “in public” is not consent to being researched, their imagined audience may be quite different. (see AoIR guidance, Marwick and boyd 2011) – Are you engaging with private or “public” figures – expectations over visibility will vary significantly. • How possible is it to obtain informed consent for work undertaken with your chosen social media platform? How can consent be withdrawn? • How will your data be collected and used? (Attributed vs Pseudonyms vs Anonymous). • What personal data is being used? Does it put anyone at risk? • What is the risk of accidental exposure or re-identification? Text snippets, quotes and images may all be easily searchable. • Public – or previously public – data can change in sensitivities over time. • How will you handle/remove/retain subsequently deleted content Marwick, A. and boyd, d., 2011. I tweet honestly, I tweet passionately: Twitter users, context collapse, and the imagined audience. In new Media & Society, 12 (1), pp. 144-133. DOI: 10.1177/1461444810365313.
  12. 12. Recommended: AoIR Ethics Guidance • AoIR Ethics Guidance (2012): https://aoir.org/reports/ethics2.pdf • AoIR Ethics Chart – a quick guide to key issues: https://aoir.org/aoir_ethics_graphic_2 016/ • AoIR Ethics Guidance (2002): https://aoir.org/reports/ethics.pdf • Annette Markham (co-author of AoIR guidance) on Impact Models for ethical decision making in data research and design: https://annettemarkham.com/2017/0 7/impact-model-ethics/
  13. 13. Recommended: Social Media Research: A Guide to Ethics • Excellent concise research ethics guidance from the ESRC- funded “Social Media, Privacy and Risk: Towards More Ethical Research Methodologies” project at University of Aberdeen. • Includes pointers to further social media ethics resources. • Townsend, L. and Wallace, C. 2016. Social Media Research: A Guide to Ethics. Aberdeen: University of Aberdeen/ESRC Social Media Enhancement project. Available from: http://www.dotrural.ac.uk/soc ial-media-research-ethics/
  14. 14. “But the data is already public” In 2008 researchers released profile data (The T3 Data Set) from Facebook accounts of students at a US University, inadvertently making identifiable data public, as reported in Zimmer (2010). In this case the researchers: • Had employed RAs who were part of the Network being examined and had (various levels of) access to more information than a non-logged-in user of Facebook/user beyond the Network. • Had funding that mandated open publishing and sharing of results. • Had University but not individuals consent for data collection • Combined Facebook with university housing data in their data sets • Obscured the identity of the university where students were based, but described key characterstics • Attempted to make all data anonymous by removing identifying information (name, student id, etc.) but left network and behavioural information intact. • Asked other researchers using the data not to attempt to reidentify subjects. • Stated that “hackers” and “extreme effort” would be the only way to “crack” the data. The university was identified swiftly based purely on the codebook and other writings about the data – but not requiring direct access to the data. Once the university was identified, other specific identifying data (nationality, race, home state, etc.), sometimes with only 1 individual in these groups, made re-identification of (some) students simple. After public scrutiny and identification of the university, the data set was swiftly withdrawn by the researchers. Zimmer, M. 2010. “But the data is already public”: on the ethics of research in Facebook. In Ethics and Information Technology, 12 (4), December 2010, pp. 313-325. https://link.springer.com/article/10.1007%2Fs10676-010-9227-5
  15. 15. Terms & Conditions • Before undertaking any social media research understand the T&Cs and Developer T&Cs for the platform(s) you are looking at. • Understand how your research aligns with the T&Cs, and any possible issues of privacy, etiquette, or practical access. • If your work is in conflict with T&Cs either re-design your research (strongly recommended) or look carefully at risks and impacts. • You should not ignore any T&Cs for technical reasons. If there is a valid reason to ignore T&Cs for specific research reasons (such as research on deleted tweets), be prepared to justify that to ethics boards and peer reviewers. And understand that you may risk losing access to the platform and your research data if you are found to be in breach of T&Cs.
  16. 16. Twitter Developer T&Cs of note (1) Section VII (Other Important Terms), A: User Protection: "Twitter Content, and information derived from Twitter Content, may not be used by, or knowingly displayed, distributed, or otherwise made available to:"… "any entity for the purposes of conducting or providing surveillance, analyses or research that isolates a group of individuals or any single individual for any unlawful or discriminatory purpose or in a manner that would be inconsistent with our users' reasonable expectations of privacy;" https://developer.twitter.com/en/developer-terms/agreement-and-policy
  17. 17. Twitter Developer T&Cs of note (1) Section VII (Other Important Terms), C: Respect Users' Control and Privacy: "3. If Content is deleted, gains protected status, or is otherwise suspended, withheld, modified, or removed from the Twitter Service (including removal of location information), you will make all reasonable efforts to delete or modify such Content (as applicable) as soon as reasonably possible, and in any case within 24 hours after a request to do so by Twitter or by a Twitter user with regard to their Content." https://developer.twitter.com/en/developer-terms/agreement-and-policy
  18. 18. Facebook Statement of Rights & Responsibilities Section 5: Protecting Other People's Rights "We respect other people's rights, and expect you to do the same. 1. You will not post content or take any action on Facebook that infringes or violates someone else's rights or otherwise violates the law. 2. We can remove any content or information you post on Facebook if we believe that it violates this Statement or our policies. 3. We provide you with tools to help you protect your intellectual property rights. To learn more, visit our How to Report Claims of Intellectual Property Infringement page. 4. If we remove your content for infringing someone else's copyright, and you believe we removed it by mistake, we will provide you with an opportunity to appeal. 5. If you repeatedly infringe other people's intellectual property rights, we will disable your account when appropriate. 6. You will not use our copyrights or Trademarks or any confusingly similar marks, except as expressly permitted by our Brand Usage Guidelines or with our prior written permission. 7. If you collect information from users, you will: obtain their consent, make it clear you (and not Facebook) are the one collecting their information, and post a privacy policy explaining what information you collect and how you will use it. 8. You will not post anyone's identification documents or sensitive financial information on Facebook. 9. You will not tag users or send email invitations to non-users without their consent. Facebook offers social reporting tools to enable users to provide feedback about tagging." https://www.facebook.com/terms.php
  19. 19. Trust in Social Networks vs Trust in Research Research Ethics – Randall Munroe/xkcd (https://xkcd.com/1390/) Licensed under CC-BY-NC 2.5 Trust in social media networks is mixed, with users increasingly savvy about data use… However… • Social Media users can find observation by academic researchers more disconcerting than by the companies who own the platforms. • Research, depending on the topic, can feel like a judgement on behaviours making consent hugely important. • The burden on researchers to be clear about motives, funders, process, etc. is higher than on commercial companies. • There are parallels here to how individuals feel about e.g. Tesco Clubcard or Credit Card data capture vs. surveys and censuses.
  20. 20. Question/Discussion What are the ethical concerns and considerations for your current (or previous) social media research?
  21. 21. Obtaining Consent • Consent may be implicitly included for API data access in some terms and conditions BUT, when did you last read the terms and conditions? What about your research participants? So: • Obtain explicit consent wherever possible. • Be transparent if you are engaging in research in a space – with a pinned post, link to your participant information sheet, etc. • Consent can be tricky in anonymous and less traditional social media spaces (see e.g. Osborne 2017 for approaches used with Yik Yak). • Apply particular caution to gaining consent for screen shots, attributed posts, reproducing exact images or text of posts etc. Osborne, N. 2017. Addressing ethics of research in anonymous online spaces. In “A Live Pulse”: Yik Yak for understanding teaching, learning and assessment at Edinburgh [blog], 13th July 2017. http://yikyakresearch.blogs.edina.ac.uk/2017/07/13/addressing-ethics-of-research-in-anonymous-online-spaces/
  22. 22. Some Common Ethics Pitfalls • Researcher assumes public data can be used in any way desired, without considering the subject(s) intent when originally sharing their profile/post etc. • Researcher explores conveniently available “public” data without realising that privacy settings may make more information available to them, than is truly “public”. • Researcher is using “big” data under belief that individuals will not be identifiable (as in the “But the data is already public” case). • Research subject(s) has shared data on a public site but is not aware of their own settings, or has not checked them lately, making implicit consent and the public nature of the data problematic. Discovering that they have been included in published research may be upsetting and problematic. • Research Ethics Committees and/or Journal Editorial Boards are unaware or do not properly consider that social media data includes real names, pseudonyms, locations, highly disclosive data and do not ask the right questions around the consent process, collection, aggregation, storage and retention of data. • Researcher uses full text of a post as an “anonymous” example but this is then Googled which identifies the original post/tweet/content and individual.
  23. 23. Data Considerations • What kind of research approach are you taking? • Who or what is the subject of your research – what is the right social media space to capture appropriate data? • What scale of data are you looking to collect/harvest? (If working with big data see boyd & Crawford 2012) • Will you be sampling or looking to collect all data over a specific time period? • How sensitive is the topic? • What level and type of consent can you obtain from participants? • What kind of content? – Profiles – for network analysis, image analysis, qualitative review of content through profile components/data? – Posts – through API/data feed/harvesting or observation? Textual, visual, multimedia? Manual coding or text/data mining? – Comments/discussion – contents or threads of discussion? – Metadata – tags, likes, engagements? • Time bounds – how long do you expect to collect data for? • What use will you make of the data after capture? boyd, d. and Crawford, K., 2012. Critical questions for big data. In Information, Communication & Society special issue: A decade in internet time: the dynamics of the internet and society, 15 (5). http://dx.doi.org/10.1080/1369118X.2012.678878
  24. 24. Sources of baseline data on usage, access, trends, literacies etc. • Oxford Internet Surveys: biennial data on UK public use and attitudes to the internet, including social media: http://oxis.oii.ox.ac.uk/research/dataset-request/ • Ofcom research and data: Regular reporting on UK public use and attitudes to media, including internet and social media: https://www.ofcom.org.uk/research- and-data/search. Includes: – Annual adult media use and attitudes, and children’s media literacy reporting: https://www.ofcom.org.uk/research-and-data/media-literacy-research; – Communications Market Report: annual overview at consumer use of communications of all types: https://www.ofcom.org.uk/research-and-data/multi-sector-research/cmr – Further regular and one-off data via the statistical release calendar: https://www.ofcom.org.uk/research-and-data/data/statistics • Pew Internet & American Life datasets: data on US public use, knowledge and understanding of the web, digital literacy, social media, etc: http://www.pewinternet.org/datasets/. For example: – Social Media Update 2016: http://www.pewinternet.org/2016/11/11/social-media-update-2016/
  25. 25. Sources of Official Social Media Usage Data, Trends, Financials, etc. Best sources are quarterly earnings reports and presentations, typically including: monthly active users, usage trends, earnings, monetization strategies, financials, future plans: • Facebook & Instagram & WhatsApp: https://investor.fb.com/home/default.aspx • Twitter: https://investor.twitterinc.com/results.cfm • SnapChat: https://investor.snap.com/events-and-presentations/events • YouTube/Google via Alphabet: https://abc.xyz/investor/ • Flickr: – Currently owned by Oath, should be via Verizon once deal closes: http://www.verizon.com/about/investors – Historical up to 2017, via Yahoo captures in the Internet Archive: https://web.archive.org/web/*/https://investor.yahoo.net/index.cfm • LinkedIn: – Current via Microsoft: https://www.microsoft.com/en-us/investor/ – Historical up to 2016: https://news.linkedin.com/topic/earnings • Weibo: http://ir.weibo.com/phoenix.zhtml?c=253076&p=irol-irhome
  26. 26. Privately Held Social Media • Crunchbase (https://www.crunchbase.com/) is a good source of information on shareholders/owners, acquisitions, finances, etc. • Alexa web rankings (owned by Amazon) give an overview of usage levels and trends based on ranking relative to other sites in the US, and globally. • Social Media sites’ “business” and “press” sites, official blogs and news releases are best for user data. • Some social media provide advertising APIs – which may be usable for research depending on T&Cs and data content - but not developer or open APIs, e.g. Snapchat: https://www.snap.com/en-GB/news/post/third-party- applications-and-the-snapchat-api/ e.g: – Pinterest: • data on usage from Pinterest: https://business.pinterest.com/en • Alexa data on usage: https://www.alexa.com/siteinfo/pinterest.com • investor data: https://www.crunchbase.com/organization/pinterest/investors/investors_list
  27. 27. Data Quality & Reliability • Data sources and APIs can change regularly, and what is available may change over time (e.g. Twitter moved from all to “Top” tweets some years ago for its API; Facebook have changed data structures multiple times). • Errors in automated data collection can be hard to spot until analysis is undertaken – sampling, trial data collection, and review of code by colleagues can all be useful. • Gaps in data may occur because there are genuine gaps in data creation/posting etc; because there are technical issues with the social media service; because of an error in your code; or because you are over your API rate limit for the minute/hour/day. • Data may change over time – Facebook and Instagram allow posts to be edited so a request will capture one moment in time not necessarily the original or final versions. • Data may disappear over time. Notable example: the Twitter deletions terms and conditions means that deleted tweets will not appear in a later API call. – Research tools obeying the T&Cs will also update and remove deleted tweets. – Research tools retaining deleted tweets are technically in breach of the T&Cs. • Acquisitions, Mergers, and shut downs of social media sites can lead to changed terms and conditions, changes to data availability and use, changes or removals of APIs and data access routes, changes to user presence in a space, acceptable norms within a space (important for qualitative work particularly).
  28. 28. Hidden pre-filtering and sampling • Not all social media posts are equally likely to be included in standard API endpoints – e.g. a Twitter user with few posts and few followers is unlikely to appear on a popular hashtag. – The standard "Streaming" and "Search" APIs include 1% of Tweets and varies in accuracy depending on activity/time etc. (See Morstatter et al 2013). • Privacy settings will reduce the accuracy of any data sampled from Facebook or other more complex privacy networks but it is hard to see what is being excluded. Morstatter, F., Pfeffer, J., Liu, H. and Carley, K.M., 2013. Is the Sample good enough? Comparing data from Twitter's streaming API with Twitter's Firehose. In ICWSM 2013 and eprint arXiv:1306.5204. Available from: https://arxiv.org/abs/1306.5204
  29. 29. Question/Discussion Have you already tried obtaining data for the social media space you are using in your research? Have you faced any challenges or obstacles?
  30. 30. Existing Data Sets • “The Zuckerberg files”: digital archive of all public comments by Mark Zuckerberg including social media and mainstream media content for research use: https://www.zuckerbergfiles.org/ • FiveThirtyEight Data: archive of data associated with FiveThirtyEight articles, including social media data sets: https://github.com/fivethirtyeight/data • Lumen database – tracking legal notices and complains for removal of online materials (including social media content): https://www.lumendatabase.org/ • CSIRO (Australia’s national science agency) We Feel – emotions in Tweets – API: http://wefeel.csiro.au/#/api (see: http://datadrivenjournalism.net/resources/we_feel) • Stanford Large Network Dataset Collection - includes social network data sets: https://snap.stanford.edu/data/ • Network Repository – network datasets, including social media, Facebook and Twitter networks: http://networkrepository.com/ • DocNow – social justice social network archives: http://www.docnow.io/
  31. 31. Cross-site data tools • North Caroline Social Media Archive Toolkit: https://www.lib.ncsu.edu/social-media-archives-toolkit; see also: https://github.com/NCSU-Libraries/Social-Media-Combine • Social Mention (search engine for social media) API: http://www.socialmention.com/api/ • Scrapebox (premium tool) YouTube Downloader: http://www.scrapebox.com/youtube-downloader and Social Account Scraper: http://www.scrapebox.com/social-account-scraper • ESRC COSMOS Open Data Tools (available but no longer updated since 2014): http://socialdatalab.net/software • Overview of Twitter data tools (Ahmed 2015): http://blogs.lse.ac.uk/impactofsocialsciences/2015/07/10/social- media-research-tools-overview/
  32. 32. Recommended: DMI Tools The Digital Methods Initiative add new (documented) tools all the time, including: • Censorship Explorer – determine censorship in various regions through URLs & proxies. • Discus (Disqus) Comment Scraper – obtain data from the Discus comment plugin. • Expand Tiny URLs – automatically expand large collections of Tiny URLs (e.g .from tweets). • Geo IP – translate URLs or IP addresses into geographic locations (e.g. for a blog). • Instagram Hashtag Explorer – retrieve Instagram media via specific hashtags. • Issue Crawler – uses URLs to analyse relationships and connections through links between URLs. • Netvizz (Facebook) – extracts data from Facebook around groups, pages, search. • Pinterest Scraper – scrapes Pinterest URLs and captures metadata of pins. • Tumblr – data capture based on a Tumblr tags which retrieves metadata and co-incident tags. • Twitter Capture and Analysis Toolset (DMI-TCAT) – robust and reproducible tool for data capture and analysis of Twitter data. Source code available for local use. • YouTube Data Tools – extract data on YouTube channels and videos, e.g. channel networks. Access documentation and DMI tools at: https://wiki.digitalmethods.net/Dmi/ToolDatabase See also, DMI Protocols: https://wiki.digitalmethods.net/Dmi/DmiProtocols
  33. 33. https://github.com/digitalmethodsinitiative/dmi-tcat/wiki
  34. 34. Internet Archive & WayBackMachine • Global archive capturing websites (to various levels of detail/depth) based on IA targets and user-submitted requests (since 2001). • You can request a site for archiving, or a group of sites. • Searchable resource OR can use exact URL to retrieve previous archived pages (WayBackMachine). • Collections exist for various social media collections, e.g: – 2016 US Presidential Election Social Media: https://archive.org/details/2016electiontwitter – Arab America on Social Media: https://archive.org/details/ArchiveIt-Collection-2797 – Gif Cities (Gifs from GeoCities): https://gifcities.org/ • Great for social media website changes, blogs, terms and conditions versions, etc. • Sites available in a range of archive formats (IA), or as viewable pages (WayBackMachine). • See: – https://archive.org/ – https://archive.org/web/
  35. 35. https://gifcities.org/?q=star+wars
  36. 36. UK Web Archive • Run by the British Library (since 2004). • Indexes (UK/related) sites to a greater depth than the Internet Archive. • Smaller archive. • You can request a site for archiving. • Special Collections include: – UK Blogs: https://www.webarchive.org.uk/ukwa/collection/100698/page/1/source/colle ction – London Terror Attacks, 2005 (mainstream and social media commentary): https://www.webarchive.org.uk/ukwa/collection/100757/page/1/source/colle ction – Olympic & Paralympic Games 2012 (mainstream and social media): https://www.webarchive.org.uk/ukwa/collection/4325386/page/1/source/coll ection • See: https://www.webarchive.org.uk/ukwa/.
  37. 37. Other Web Archive Resources • Rhizome: archiving for internet art, including interactive works engaging with/critiquing social media: http://rhizome.org/ • Note: EDINA are currently working on an archiving tool for researchers, ask me for more info on Site2Cite.
  38. 38. Using APIs to obtain Data • APIs (Application Programming Interfaces) exist for most social media sites and allow direct requests for data. • Some unofficial APIs exist for sites without official/open APIs. Use only with caution as these frequently have privacy, security or legal issues. • Consider working with text and data mining colleagues, or developers, to seek additional ways to capture data such as: – Screen scraping (automated capture of pages from a user perspective). – Mobile data collection or data capture approaches to social media. – Internet archiving approaches using standard tools or code libraries
  39. 39. Glossary: Data Request Terms • API: Application Programming Interface – a way to request data from a web service. • REST or RESTful API: REST stands for “Representational State Transfer” and means an API that uses HTTP (the protocol for accessing websites) requests (or “calls”) to: – GET – read access to content such as posts, users, etc. This is the main request you would use to retrieve data. – PUT – update or replace data. – POST – create new data (such as a post to a blog, a wiki page, etc.). – DELETE – Delete content. • An API Endpoint – is essentially the way to address and structure what kind of request you are making. E.g. home_timeline vs user_timeline. Each endpoint provides a different entry to the data behind a web service. • In a REST GET request you may have: – Fields – the various fields of data you want to retrieve, e.g. link, message, post, etc. These are usually shown in the Developer Documentation. – Modifiers or Parameters - these act like filters, limiting the request in a specific way, e.g. only retrieving posts with a location attached. – Operators – are the various standard terms/labels for content and content types that you can use in your GET request to shape and customise it, for instance this might include “retweets_of” or “bio” or “has:links” etc. • Other types of APIs and M2M (Machine-to-Machine) interfaces exist including “SOAP” and “RPC”. • SDK is Standard Developer Kit and is used increasingly often as a way to package various requests for developers to use in web or mobile apps (SDKs has been used as a term for the coding tools for smartphone platforms iOS and Android for years).
  40. 40. Locating or Requesting Social Media Data ProgrammableWeb (https://www.programmableweb.com/) is a great source for API information for social media sites: • Instagram Developer: https://www.instagram.com/developer/ – API Endpoints: https://www.instagram.com/developer/endpoints/ • Twitter Developer: https://developer.twitter.com/ – APIs: https://developer.twitter.com/en/docs – GNIP: http://support.gnip.com/apis/ - premium "Firehose" access. See also Twitter Enterprise: https://developer.twitter.com/en/enterprise – Free APIs cover 7 days tweets; Premium APIs exist for 30-day search and full archive search. – Facebook for Developers: https://developers.facebook.com/ – API (Graph API): https://developers.facebook.com/docs/graph-api/ • YouTube Developers: https://developers.google.com/youtube/ – APIs (Comments and Comment Threads particularly useful): https://developers.google.com/youtube/v3/docs/ • Weibo API: http://open.weibo.com/wiki/API%E6%96%87%E6%A1%A3/en
  41. 41. How do you make an API call? • For open RESTful APIs you can enter an HTTP request in any browser window, e.g. http://services.groupkt.com/state/get/USA/all • Most social media APIs now require you to register your app, request a key from them and for you to include the access tokens in your request. • In general API calls are made from within a small programme – this might be running on your machine or from a browser based coding tool. • Lots of existing tools based on social media APIs exist – see later slide for a sample of these. • Try it out: – Codecademy Twitter API tutorial: https://www.codecademy.com/en/tracks/twitter
  42. 42. An API Endpoint is a bit like a vending machine… Vending machine priced by grams of fat, Google, San Jose, California.jpg by Flickr user Cory Doctorrow. You have to use the right machine to get hold of the item you want, then you have to enter the right code and the right price to get your candy. • Each item has a name, and a standard way to access it (in a vending machine this is the item code). • Each item has a value (in a vending machine this is the delicious edible contents of each item). • Each item requires some sort of trust exchange before you can access it (in a vending machine this is cash). • In an API that “E12” item code is actually going to look more like: https://api.twitter.com/1.1/statuses/user_timeline.jso n?screen_name=twitterapi&count=2 • In an API the price is usually a unique key/access token that is unique to you and your app – that indicates a legitimate request and who it’s from. Bonus: In APIs there is usually a huge range of data (research candy) to ask for, and lots of filtering options.
  43. 43. What will you get back from an API GET request? Assuming it has worked correctly, something like this… See the full example at: https://developer.twitter.com/en/docs/tweets/timelines/api- reference/get-statuses-user_timeline.html Each of these is a new field for a single tweet and it’s value. [] is an empty field (e.g. no hashtag on this tweet).
  44. 44. This data can then be processed by your app, or simply retrieved and stored in a database or spreadsheet…
  45. 45. Recommended Tool: Martin Hawksey’s TAGS • Uses Google Docs to capture tweets based on a hashtag, search term, user, etc. • Can be automated to allow rolling capture. • Useful for capturing a sample of long term community dialogues or public discourse where Top Tweets/7 day limits will be acceptable. • Includes spreadsheet; visualisation; searchable archive - latter two options are only available if you make data (semi) public. • Uses Twitter API – takes “Top” rather than “Latest” tweets so accuracy depends on popularity of content/hashtags. • Well documented and supported by Martin. • A great way to dip your toe in the API water – you have to obtain a key the first time you run TAGs, and can access and look at the code it runs. You can also make more advanced use of the tool and automation connecting it to other visualisations and analysis tools. • See: https://tags.hawksey.info/ • Support: https://tags.hawksey.info/forums/
  46. 46. Question/Discussion Do you have any experience or recommendations for social media data collection tools or approaches? Have you attended one of the Digital Scholarship sessions where CAHSS researchers can meet with developers and data specialists? [Recommended!]
  47. 47. Analysis & Visualisation Further information, tutorials etc. online and/or running through Digital Scholarship and Schools Research Methods training. • Nvivo (http://www.qsrinternational.com/nvivo/) – Premium qualitative data analysis software with social media and multimedia support, collaborative working also supported. Feature rich. Training available. Available through UoE/CAHSS license: https://www.ed.ac.uk/information-services/computing/desktop-personal/software/main-software-deals/nvivo. • IBM SPSS (https://www.ibm.com/analytics/us/en/technology/spss/) – Premium data analysis tool for surveys and particularly for quantitative data, widely used in social sciences. Available through UoE license: https://www.ed.ac.uk/information- services/computing/desktop-personal/software/main-software-deals/spss. • Dedoose (http://www.dedoose.com/) – Premium qualitative data analysis software with simple interface, tagging, annotation and exploration options. • Chorus (http://chorusanalytics.co.uk/) – Free software for data harvesting and analytics for social science research using Twitter data • Gephi (https://gephi.org/) – Visualisation and exploration of multiple data types, particularly good for network analysis. Feature rich so a bit of a learning curve. Free download. • D3 visualisation libraries (see: https://github.com/d3/d3/wiki/gallery) – Free collection of Javascript libraries for use in data visualisation and exploration of multiple data types. • NodeXL (https://nodexl.codeplex.com/) – Free network visualisation tool for Excel. Free. • TAGS Explorer (https://tags.hawksey.info/) – Twitter only visualisations of networks (using NodeXL) and searchable timeline archive explorations. Free. • Textal (http://www.textal.org/) – Text analysis tools for mobile use with Twitter streams, websites (inc. blogs), and documents. Free. • Tableau (https://www.tableau.com/) – Visualisation of multiple data sources and types. Free trial, otherwise monthly subscription. A large quantity of open source tools and software are available. Search for these or look at the Journal of Open Research Software (https://openresearchsoftware.metajnl.com/) or the Journal of Open Source Software (http://joss.theoj.org/) for well documented research- driven examples. See also Tony Hirst’s OU Useful Blog (https://blog.ouseful.info/) for visualisation approaches. There are also many marketing packages for social media analysis which could be used/adapted for research where their processes are well documented.
  48. 48. Appropriate Handling & Storage • Data is usually returned with unique identifiers that can be easily traced back to the original poster/subject. • The unique identifiers connect conversations and posts so are hard to strip away entirely – although you could try a one-way hash of the data to mask the identifiable information but retain connections. • Short posts and tweets are highly identifiable. Try Googling or searching Twitter for a recent tweet to see that in action. • Images and videos can also be relatively easily compared/reverse image searched and therefore identifiable. • Think about which fields you actually need to retain for your research question(s). • Plan how long you will keep your data, and how you will keep it secure - where and how you store your data really matters.
  49. 49. Data Protection & GDPR • Be aware of current Data Protection (Data Protection Act 1998) guidance on the use, storage and retention of personal data. • From 25th May 2018 the General Data Protection Regulation (GDPR) comes into effect with: – Increased rights for individuals to understand the use, access, rectification, erasure, rights to restrict processing, portability, and rights to object to the use of their data. – Increased legal measures for organisations breaching GDPR guidance. • Ensure your Consent process, your Research Data Management plans, and your use, access and disposal of data is compliant. • By default social media APIs provide a lot of data: – What is the minimum data you require? – Removing unneeded data at the point of collection and/or data cleaning will help reduce any risks of exposure or non-compliance with data protection legislation. See: • Data Protection Act 1998: https://www.legislation.gov.uk/ukpga/1998/29/contents • ICO guidance: https://ico.org.uk/for-organisations/data-protection-reform/overview-of-the-gdpr/
  50. 50. Local Support • Research Data Mantra – self-led course on Research Data Management, including appropriate handling, storage and planning for onward preservation, sharing or destruction: http://mantra.edina.ac.uk/ • Data Store – secure storage for active research data, available to all staff and PGR students: https://www.ed.ac.uk/information-services/research- support/research-data-service/working-with-data/data-storage • Working with Sensitive Data – guidance and further resources on working with sensitive and personal data: https://www.ed.ac.uk/information- services/research-support/research-data-service/working-with- data/sensitive-data • Information Security Team – guidance on legal and technical approaches to keeping data secure and appropriately encrypted and disposed of: https://www.ed.ac.uk/infosec
  51. 51. Making Research Data Open • If you have a consent process in place, ensure you request consent for any onward use you expect to make of your data. And ensure there is a process to withdraw consent for onward. • Beware verbatim quoting in publications – it can be easy to search back to the original text. – Public figures who would consider their social media content a publication and part of their profile (e.g. politicians) are more appropriate to quote, where needed. – Even if anonymous/not attributed it is safer to paraphrase short comments where possible to make reverse searching more challenging. • Screenshots of posts often reveal the subject name, image, location, and their contacts. Only use these where appropriate, properly consented to, and where you are not placing your subjects at risk. • Consider the timelag between data collection and any publication. Is your consent from participants still valid if a year has passed? What about 2 years? Or 5 years? A teen participant may feel differently about data being exposed when they are, for instance, a newly qualified lawyer or medic with very different reputational considerations. See also: University of North Carolina at Chapel Hill and UoE Research Data Management and Sharing (Coursera): https://www.coursera.org/learn/data-management
  52. 52. Courses and Information • DMI Digital Methods online course: https://wiki.digitalmethods.net/Digitalmethods/WebHome • UCL Why We Post: the Anthropology of Social Media course (FutureLearn): https://www.futurelearn.com/courses/anthropology-social-media • QUT Social Media Analytics: Using Data to Understand Public Conversations (FutureLearn): https://www.futurelearn.com/courses/social-media-analytics • Rutgers University Social Media Data Analytics (Coursera): https://www.coursera.org/learn/social-media-data-analytics • Doing Journalism with Data: First steps, skills and tools: http://learno.net/courses/doing-journalism-with-data-first-steps-skills- and-tools • UoE Digital Footprint MOOC – understand some of the challenging identity, privacy and ethical concerns around social media for you and your research subjects: https://www.coursera.org/learn/digital-footprint/
  53. 53. Useful Niche Resources • Utrecht Data School Data Ethics Decision Aid (DEDA): https://dataschool.nl/research/deda/?lang=en • Programming Historian Data Mining the Internet Archive lesson: https://programminghistorian.org/lessons/data-mining-the-internet-archive • Insight News Lab Social Network Analysis and Visualisation for #RDAPlenary 3 (using ScraperWiki and OpenRefine): http://hujo.deri.ie/rdaplenarysn/ • Tony Hirst First Baby Steps to Anonymising Data with Open Refine: https://blog.ouseful.info/2015/01/23/anonymising-data-with-open-refine/ • Tony Hirst Social Interest Positioning – Visualising Facebook Friends’ Likes with Data Grabbed Using Google Refine: https://blog.ouseful.info/2012/01/04/social- interest-positioning-visualising-facebook-friends-likes/ • Tony Hirst Grabbing Twitter Search Results into Google Refine and Exporting Conversations into Gephi  needs updating for new Twitter API: https://blog.ouseful.info/2012/10/02/grabbing-twitter-search-results-into-google- refine-and-exporting-conversations-into-gephi/
  54. 54. Local research and expertise (a small sampling thereof!) • Social media, Digital Ethnography, Sociological research methods– Kate Orton Johnsone (Sociology) • Social Media, Digital Labour – Karen Gregory (Sociology). • Communities on the Darknet, illicit markets and cultures – Angus Bancroft (Sociology). • Social media in education; bots; anonymity in social media – Sian Bayne (Research in Digital Education Centre, Moray House). • Digital cultural heritage learning and engagement– Jen Ross (Research in Digital Education Centre, Moray House); Claire Sowton (CAHSS); Melissa Terras (UCL/CAHSS). • Text and data mining of social media content – Claire Grover (Informatics); Richard Tobin (Informatics); Clare Llewellyn (Informatics; Neuropolitics Research, SPS). • Sharing of photography, autobiographical memory and distributed cognition (inc. social media) – Tim Fawns (Clinical Education, Centre for Medical Education, MVM). • Big data (inc. social media) in healthcare – Mhairi Aitken (Usher Institute, MVM). • Social media, Digital Footprint, blogging and Buddhism – Louise Connelly (Vet School). • Mobility, mobile technology, formal and informal education communities around the world – Michael Sean Gallagher (Research in Digital Education Centre, Moray House). • Playful learning in informal digital environments – Clara O’Shea (Research in Digital Education Centre, Moray House). • Social media and politics – Neuropolitics Research group: Laura Cram (Politics, SPS); Robin Hill (Informatics; SPS); Sujin Hong (SPS); Adam Moore (PPL). • Visualisation of big data, including network analysis – Benjamin Bach (Design Informatics). • Social Media and scholarly communities–Sara Shinton (IAD); James Stewart (SPS).
  55. 55. Recommended work & groups researching in this area • UoE Beyond Text Network – interdisciplinary network for social media and multimedia researchers: https://www.wiki.ed.ac.uk/display/DIG/Beyond+Text • UoE Informatics Language Technology Group – text mining expertise working on projects including topic modelling and social media analysis: https://www.ltg.ed.ac.uk/ • Digital Methods Initiative (DMI) (European multi-organisation research group): https://wiki.digitalmethods.net/Dmi/DmiAbout • Microsoft Research Social Media Collective (US) – particularly danah boyd, Nancy Baym and Kate Crawford’s work: https://www.microsoft.com/en-us/research/group/social-media-collective/ • #NSMNSS: New social media, new social science? - great blog reflecting on social science methods around social media http://nsmnss.blogspot.co.uk/ • Oxford Internet Institute – particularly strong on relationships to mainstream media environment: https://www.oii.ox.ac.uk/ • Visual Social Media Lab (Sheffield) – led by Farida Vis: http://visualsocialmedialab.org/ • DocNow – social justice social media archiving: http://www.docnow.io/ • Data Driven Journalism (European Journalism Centre and Netherlands): http://datadrivenjournalism.net/ • Analysing Social Media Collaboration (UK cross-institution group, site now dormant) – responsible for the high profile “Reading the Riots” Twitter analysis work in 2011: http://www.analysingsocialmedia.org/home • Michael Zimmer – influential work on privacy, leading projects on privacy and Facebook: http://www.michaelzimmer.org/ • Electronic Freedom Foundation –advocates with expertise on privacy and tracking in social media: https://www.eff.org/ • Centre for Social Media Research (University of Westminster): https://www.westminster.ac.uk/social-media-research • Digital Media and Society Research Group (Cardiff): https://www.cardiff.ac.uk/research/explore/research-units/digital- media-and-society • COSMOS (legacy page for Cardiff research group): http://www.cs.cf.ac.uk/cosmos/
  56. 56. Recommended Journals • First Monday (University of Illinois at Chicago): http://firstmonday.org/index • New Media & Society (Sage): http://journals.sagepub.com/home/nms • Information, Communication & Society (Taylor & Francis): http://tandfonline.com/toc/rics20/current • Social Media + Society (Sage): http://journals.sagepub.com/home/sms • Big Data & Society (Sage): http://journals.sagepub.com/home/bds • Policy & Internet (Wiley): http://onlinelibrary.wiley.com/journal/10.1002/(ISSN)1944-2866 • Journal of Computer-Mediated Communication (Wiley): http://onlinelibrary.wiley.com/journal/10.1111/(ISSN)1083-6101 • Cyberpsychology, Behaviour, and Social Networking (Mary Ann Liebert Inc.): http://online.liebertpub.com/loi/CYBER • Journal of Broadcasting & Electronic Media (Taylor & Francis): http://www.tandfonline.com/toc/hbem20/current
  57. 57. Relevant Upcoming Digital Scholarship Sessions • Digital Research Clinics and Resources (26th October 2017) • Cleaning Data with Open Refine (1st November 2017) • Regex: Regular Expressions (23rd November 2017) • Introduction to Sentiment Analysis: What it is and how to do it simply (14th December 2017) Look out for further sessions and/or contact the team with any specific requests: http://www.digital.cahss.ed.ac.uk/
  58. 58. Questions & Discussion Or follow up after today: nicola.osborne@ed.ac.uk

×