Talk held at the Royal Statistical Society in London as part of the event series "Blurring the boundaries - New social media, new social science?". I thank Grant Blank from the OII for inviting me to this exciting workshop.
Collecting Twitter data Dr. Cornelius Puschmann School of Library and Information Science Humboldt-University of Berlin / Humboldt Institute for Internet and Society 16 April 2013 Royal Statistical Society
Overview1. Examples of research using Twitter data 2. Twitters data infrastructure 3. Tools for collecting data 4. Sampling issues
Examples of research using Twitter data• Kwak, H., Lee, C., Park, H., & Moon, S. (2010). What is Twitter, a Social Network or a News Media ? Categories and Subject Descriptors. Proceedings of the 19th International Conference on the World Wide Web (WWW ’10) (pp. 591–600). Raleigh, NC.• González-Bailón, S., Borge-Holthoefer, J., Rivero, A., & Moreno,Y. (2011). The dynamics of protest recruitment through an online network. Scientiﬁc reports, 1, 197. doi:10.1038/srep00197• Ausserhofer, J., & Maireder, A. (2013). National politics on Twitter: Structures and topics of a networked public sphere. Information, Communication & Society, 16(3), 291–314. doi:10.1080/1369118X.2012.756050• Papacharissi, Z., & De Fatima Oliveira, M. (2012). Affective News and Networked Publics: The Rhythms of News Storytelling on #Egypt. Journal of Communication, 62(2), 266–282. doi:10.1111/j.1460-2466.2012.01630.x
Example questionsTwitter as a platform• How can Twitters structure be described?Social graph• Who follows whom?• How does information spread?Hashtags, keywords, and geography• How can the discussion of topic X be characterized?• Who is participating in discussions on X?• Where are users discussing X?
Example questionsURLs in Twitter• How is mass media content discussed?• How are academic papers cited on Twitter?Creative approaches• Where, when, and with what devices do people call taxis?Prediction/application• Can election results/ﬂu outbreaks/consumption patterns be reliably predicted?
Three Twitter APIsREST API 1) data: tweets,API Streaming social graph Search API• traditionally used complex tools needed • same functionality 2) • public, user, and by most client 3) constraints on how site streams as Twitter search software much data can data in • • provides be captured rate-limited• v1.0 will be phased real time and out in May 2013 largely• to be replaced by unprocessed as it more restrictive ﬂows through the v1.1 platform
Legal issues: Twitters terms of service"By submitting, posting or displaying Content on or throughthe Services, you grant us a worldwide, non-exclusive,royalty-free license (with the right to sublicense) to use,copy, reproduce, process, adapt, modify, publish, transmit,display and distribute such Content in any and all media ordistribution methods (now known or later developed)." "You agree that this license includes the right for Twitter to make such Content available to other companies, organizations or individuals who partner with Twitter for the syndication, broadcast, distribution or publication of such Content on other media and services, subject to our terms and conditions for such Content use.""We encourage and permit broad re-use ofContent. The Twitter API exists to enable this."
Legal issues: API rules"You will not attempt or encourage others to: sell, rent,lease, sublicense, redistribute, or syndicate access to theTwitter API or Twitter Content to any third party withoutprior written approval from Twitter. If you provide an APIthat returns Twitter data, you may only return IDs (includingtweet IDs and user IDs).You may export or extract non-programmatic, GUI-driven Twitter Content as a PDF orspreadsheet by using "save as" or similar functionality.Exporting Twitter Content to a datastore as a service orother cloud based service, however, is not permitted." "Except as permitted through the Services (or these Terms), you have to use the Twitter API if you want to reproduce, modify, create derivative works, distribute, sell, transfer, publicly display, publicly perform, transmit, or otherwise use the Content or Services."
Sampling approachesStrategy #1: Sample by hashtag, keyword, user, geographicallocation, or other ﬁltering parameters+ representativeness unclear - time frame and parameters on multiple levels have to be carefully chosenStrategy #2: Use the 1% or 10% sample provided by theStreaming API+ generally assumed to be - time frame has to be representative (of Twitter) carefully chosenStrategy #3: Capture Twitters entire throughput+ highly representative - technically very difﬁcult/costly (of Twitter)
Summary develop a question/general direction collect data using these or other tools store in a database or spreadsheet (CSV)annotate, analyze and visualize using a variety of tools (Excel, Tableau, R, Gephi, NVIVO, ...)