Twitter analytics: some thoughts on sampling, tools, data, ethics and user requirements

4,245
-1

Published on

Keynote delivered at the SRA Social Media in Social Research conference, London, 24 June, 2013. The presentation highlights some thoughts on sampling, tools, data, ethics and user requirements for Twitter analytics, including an overview of a series of recent tools.

Published in: Technology, Business

Twitter analytics: some thoughts on sampling, tools, data, ethics and user requirements

  1. 1. Twitter analytics: some thoughts on sampling, tools, data, ethics and user requirements Farida Vis, Information School University of Sheffield @flygirltwo Keynote SRA Social Media in Social Research conference, London, 24 June 2013.
  2. 2. READING THE RIOTS ON TWITTER Rob Procter (University of Manchester) Farida Vis (University of Leicester) Alexander Voss (University of St Andrews) [Funded by JISC] #readingtheriots
  3. 3. What role did social media play? 2.6 million riot tweets (donated by Twitter) – 700,000 individual accounts Initially: o Role of Rumours o Did incitement take place? [no - #riotcleanup] o What is the role of different actors on Twitter?
  4. 4. Role of Rumours
  5. 5. Guardian Interactive Team (Alastair Dant) http://www.guardian.co.uk/uk/interactive/20 11/dec/07/london-riots-twitter Data Journalism Award (sponsored by Google)
  6. 6. • Lots of questions about methods • Lots of questions about our tools • Lots of questions about donated data • Lots of questions about ethics
  7. 7. Actively engaged on Twitter
  8. 8. Actor Types – top 1000 mentions Typical long tail distribution Twitter researchers tend to focus on the head
  9. 9. Actor Types Mainstream Media Police/emergency services Only online media (news) Riot accounts Non-(news) mainstream media Celebrities Journalists (mainstream media) Researchers Journalists (online media) Members of the public Non-(news) media organisations Bots Bloggers Unclear Activists Account closed down UK Twitterati Fake/spoof account Political Actors Other http://researchingsocialmedia.org/2012/01/24/reading-the-riots-on-twitter-who-tweeted-the-riots/
  10. 10. Who tweeted the riots? - categories mainstream media journalists riot accounts
  11. 11. You know you’re dealing with Twitter data when… Number 13, 6697 mentions Number 20, 5939 mentions Number 23, 5527 mentions
  12. 12. Context Context Context
  13. 13. Individual accounts with > 3K mentions
  14. 14. 30031 mentions, 441 tweets sent over 4 days: top UK listed journalist (2) 3484 mentions, 290 tweets sent over 4 days: top non UK listed journalist (34)
  15. 15. Image sharing practices during crises
  16. 16. 400 million tweets/day (March 2013) 40 million Instagram images/day (January 2013) Percentages posted to Twitter / Facebook -> 59% posted to Twitter -> 98% posted to Facebook
  17. 17. Where do images fit in the era of ‘Big Data’?
  18. 18. Big Data – text + number driven Images: undervalued, underexplored Not by the users
  19. 19. Deleted content http://twitpic.com/62m6nx
  20. 20. #FakeSandy pics 250,000 tweets (4hrs) 1 weekend http://istwitterwrong.tumblr.com/ Jean Burgess Farida Vis Axel Bruns
  21. 21. ‘fakes’ http://www.guardian.co.uk /news/datablog/2012/nov/ 06/fake-sandy-pictures- social-media
  22. 22. Twitter handles MPSBarkDag MPSBarnet MPSBexley MPSBrent MPSBromley MPSCamden metpoliceuk MPSWestminster MPSCroydon EalingMPS MPSEnfield MPSGreenwich MPSHackney MPSHammFul MPSHaringey MPSHarrow MPSHavering MPSHillingdon MPSHounslow MPSIslington MPSKenChel MPSKingston LambethMPS MPSLewisham MPSMerton MPSNewham MPSRedbridge MPSRichmond MPSSouthwark MPSSutton MPSTowerHam MPSWForest MPSWandsworth Plus: @MetPoliceEvents (Updates from the Met Police regarding demonstrations & events in London) @MPSOnTheStreet (An official MPS account giving an officer on the ground's view of events, operations and other policing activities in London) @MPSDoI (Updates from the Metropolitan Police Service, Directorate of Information) Police tweets
  23. 23. Collecting the data Scraper by Jacopo Ottaviani URL for the scraper: https://scraperwiki.com/scrapers/police_and_the_olympics_2012/ ScraperWiki is a key DDJ site
  24. 24. Datajournalismhandbook.org Reference point 1
  25. 25. Data challenges • Collecting Twitter data in (real) time (APIs) • Methods for building a reliable corpus • Problems with language bias • Problems with hashtag/keyword bias • API bias • Demographics of Twitter users – who are they? • Problems with escalating volume • Mapping explosion of new tools: are they any good? • Off the shelf tools (growing divide in research capacity in this area) • Limitations of the tools • Problems with data sharing / replicating studies + findings
  26. 26. Data challenge 1: Know your API
  27. 27. See: https://dev.twitter.com/start
  28. 28. 1% random sample of the firehose If not rate limited – all data may be collected
  29. 29. FIREHOSE
  30. 30. Data challenge 2: API bias?
  31. 31. We collect and analyse messages exchanged in Twitter using two of the platforms publicly available APIs (the search and stream specifications). We assess the differences between the two samples, and compare the networks of communication reconstructed from them. The empirical context is given by political protests taking place in May 2012: we track online communication around these protests for the period of one month, and reconstruct the network of mentions and re- tweets according to the two samples. We find that the search API over- represents the more central users and does not offer an accurate picture of peripheral activity; we also find that the bias is greater for the network of mentions. We discuss the implications of this bias for the study of diffusion dynamics and collective action in the digital era, and advocate the need for more uniform sampling procedures in the study of online communication. (González-Bailó n et al, 2012)
  32. 32. Data challenge 3: rate limiting + 1%
  33. 33. Random sampling with the streaming API: the 1% ‘If we estimate a daily tweet volume of 450 million tweets (Farber), this would mean that, in terms of standard sampling theory, the 1% endpoint would provide a representative and high resolution sample with a maximum margin of error or 0.06 as a confidence level of 99%, making the study of even relatively small subpopulations within that sample a realistic option.’ (Gerlitz and Rieder, 2013)
  34. 34. Data challenge 4: relation to firehose?
  35. 35. ‘The essential drawback of the Twitter API is the lack of documentation concerning what and how much data users get. This leads researchers to question whether the sampled data is a valid representation of the overall activity on Twitter. In this work we embark on answering this question by comparing data collected using Twitter’s sampled API service with data collected using the full, albeit costly, Firehose stream that includes every single published tweet.’ (Morstatter et al, 2013)
  36. 36. Data challenge 5: relation to ‘general public’?
  37. 37. Data challenge 6: what data to collect?
  38. 38. For hashtag datasets: contributions made by specific users and groups of users; overall patterns of activity over time; combinations to examine contributions by specific users and groups over time. (Bruns and Stieglitz, 2013)
  39. 39. Data challenge 6: how to collect the data?
  40. 40. TWITTER TOOLS
  41. 41. Recent explosion in Twitter tools • Twitonomy • Scraperwiki • TAGS • DMI Twitter Capture and Analysis Toolset • MozDeh (and Webometric Analyst) • NViVO 10 • YourTwapperKeeper
  42. 42. Twitonomy (REST + search API)
  43. 43. Scraperwiki
  44. 44. #horsemeat still producing data in June!
  45. 45. Tweet mapping: geolocations
  46. 46. TAGS
  47. 47. Collects up to 8000 tweets based on hashtags/keywords/users
  48. 48. DMI Twitter Capture and Analysis Toolset
  49. 49. DMI tools for extracting links (all the URLs) Mostly URLS are shorted, mainly using t.co (Twitter). Unpack them using: Didn’t always work, manual unpacking and note taking (plus you still have the shortened URL in case you want to retrace it.
  50. 50. MOZDEH (and Webometric Analyst)
  51. 51. NViVO 10
  52. 52. YourTwapperKeeper
  53. 53. Data challenge 7: how to analyse the data?
  54. 54. What to do about all those bots?
  55. 55. For hashtag datasets: contributions made by specific users and groups of users; overall patterns of activity over time; combinations to examine contributions by specific users and groups over time. (Bruns and Stieglitz, 2013)
  56. 56. Data collected + methods used produce specific research object
  57. 57. Where do images fit in the era of ‘Big Data’?
  58. 58. Data challenge 8: representing your data?
  59. 59. Data visualisations: what are they and what do they want?
  60. 60. Data challenge 9: how to deal with ethics?
  61. 61. Data challenge 10: user requirements?
  62. 62. What do we want from these APIs, the data, the tools, and Twitter researchers so that we can develop more robust social scientific research on Twitter?
  63. 63. @flygirltwo
  64. 64. References • Bruns, A., and Stieglitz, S. 2013. Towards More Systematic Twitter Analysis: Metrics for Tweeting Activities. International Journal of Social Research Methodology. DOI:10.1080/13645579.2013.770300 Available from: http://snurb.info/files/2013/Towards%20More%20Systematic%20Twitter%20Analysis %20(final).pdf • Gerlitz, C. & Rieder, B. 2013. Mining One Percent of Twitter: Collections, Baselines, Sampling. M/C Journal, Vol. 16, No 2. Available from: http://journal.media- culture.org.au/index.php/mcjournal/article/viewArticle/620 • González-Bailó n, S., Ning, W., Rivero, A., Borge-Holthoefer, J., & Moreno, Y. 2012. Assessing the Bias in Communication Networks Samples from Twitter. Available from: http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2185134 • Morstatter, F., Pfeffer, J., Liu, H, & Carley, K.M. 2013. Is the Sample Good Enough? Comparing Data from Twitter’s Streaming API with Twitter’s Firehose. Association for the Advancement of Artificial Intelligence. Available from: http://www.public.asu.edu/~fmorstat/paperpdfs/icwsm2013.pdf • Vis, F. 2012 . Twitter as a reporting tool for breaking news: journalists tweeting the 2011 UK riots, Digital Journalism 1(1). Available from: http://www.tandfonline.com/doi/full/10.1080/21670811.2012.741316#.UcwBZ-CPDao • Vis, F., Faulkner, S., Parry, K., Manyukhina, Y., and Evans, L. (in press), Twitpic-ing the riots: analysing images shared on Twitter during the 2011 UK riots, in Twitter and Society, Weller, K., Bruns, A., Burgess, J.,Mahrt, M., and Puschmann, C. (eds.), New York: Peter Lang.
  65. 65. Links to all mentioned tools • Twitonomy - http://www.twitonomy.com/ • Scraperwiki - https://beta.scraperwiki.com/ • TAGS - http://mashe.hawksey.info/2013/02/twitter-archive- tagsv5/ • DMI Twitter Capture and Analysis Toolset - https://wiki.digitalmethods.net/Dmi/ToolDmiTcat • MozDeh (and Webometric Analyst) - http://mozdeh.wlv.ac.uk/ + http://lexiurl.wlv.ac.uk/ • NViVO 10 - http://www.qsrinternational.com/products_nvivo.aspx • YourTwapperKeeper - https://github.com/540co/yourTwapperKeeper See also: http://mappingonlinepublics.net/tag/yourtwapperkeeper/

×