Collecting Twitter data
           Dr. Cornelius Puschmann
   School of Library and Information Science
       Humboldt-University of Berlin /
   Humboldt Institute for Internet and Society
                 16 April 2013
            Royal Statistical Society
Overview
1. Examples of research using Twitter data


            2. Twitter's data infrastructure


               3. Tools for collecting data


                         4. Sampling issues
Examples of research using
      Twitter data
•   Kwak, H., Lee, C., Park, H., & Moon, S. (2010). What is Twitter, a Social
    Network or a News Media ? Categories and Subject Descriptors.
    Proceedings of the 19th International Conference on the World Wide Web
    (WWW ’10) (pp. 591–600). Raleigh, NC.

•   González-Bailón, S., Borge-Holthoefer, J., Rivero, A., & Moreno,Y. (2011). The
    dynamics of protest recruitment through an online network. Scientific
    reports, 1, 197. doi:10.1038/srep00197

•   Ausserhofer, J., & Maireder, A. (2013). National politics on Twitter: Structures
    and topics of a networked public sphere. Information, Communication &
    Society, 16(3), 291–314. doi:10.1080/1369118X.2012.756050

•   Papacharissi, Z., & De Fatima Oliveira, M. (2012). Affective News and
    Networked Publics: The Rhythms of News Storytelling on #Egypt. Journal of
    Communication, 62(2), 266–282. doi:10.1111/j.1460-2466.2012.01630.x
Example questions
Twitter as a platform
• How can Twitter's structure be described?
Social graph
• Who follows whom?
• How does information spread?
Hashtags, keywords, and geography
• How can the discussion of topic X be characterized?
• Who is participating in discussions on X?
• Where are users discussing X?
Example questions
URLs in Twitter
• How is mass media content discussed?
• How are academic papers cited on Twitter?
Creative approaches
• Where, when, and with what devices do people
  call taxis?

Prediction/application
• Can election results/flu outbreaks/consumption
  patterns be reliably predicted?
#phdchat data set (30k tweets)
visualization of keywords using Gephi
Extracting Twitter data
HTTP request
           return all data from a given user/hashtag/geolocation/...



                 Application Programming
                      Interface (API)



             Data (usually in a database or spreadsheet)
Tweet in
browser


Tweet
source
via API
Three Twitter APIs




REST API          1) data: tweets,API
                        Streaming social graph
                                             Search API
• traditionally used complex tools needed • same functionality
                  2)    • public, user, and
  by most client 3) constraints on how
                          site streams         as Twitter search
  software        much data can data in      •
                        • provides be captured rate-limited
• v1.0 will be phased     real time and
  out in May 2013         largely
• to be replaced by       unprocessed as it
  more restrictive        flows through the
  v1.1                    platform
Legal issues: Twitter's terms of service
"By submitting, posting or displaying Content on or through
the Services, you grant us a worldwide, non-exclusive,
royalty-free license (with the right to sublicense) to use,
copy, reproduce, process, adapt, modify, publish, transmit,
display and distribute such Content in any and all media or
distribution methods (now known or later developed)."

                  "You agree that this license includes the right for Twitter to
                  make such Content available to other companies,
                  organizations or individuals who partner with Twitter for
                  the syndication, broadcast, distribution or publication of
                  such Content on other media and services, subject to our
                  terms and conditions for such Content use."

"We encourage and permit broad re-use of
Content. The Twitter API exists to enable this."
Legal issues: API rules
"You will not attempt or encourage others to: sell, rent,
lease, sublicense, redistribute, or syndicate access to the
Twitter API or Twitter Content to any third party without
prior written approval from Twitter. If you provide an API
that returns Twitter data, you may only return IDs (including
tweet IDs and user IDs).You may export or extract non-
programmatic, GUI-driven Twitter Content as a PDF or
spreadsheet by using "save as" or similar functionality.
Exporting Twitter Content to a datastore as a service or
other cloud based service, however, is not permitted."

                  "Except as permitted through the Services (or these Terms),
                  you have to use the Twitter API if you want to reproduce,
                  modify, create derivative works, distribute, sell, transfer,
                  publicly display, publicly perform, transmit, or otherwise use
                  the Content or Services."
Tweet Archivist Desktop
(Windows desktop software)
yourTwapperKeeper
(runs on a dedicated web server)
140kit
(hosted platform for
 academic research)
DataSift/Gnip
(social data resellers)
Sampling approaches
Strategy #1: Sample by hashtag, keyword, user, geographical
location, or other filtering parameters
+ representativeness unclear     - time frame and parameters
  on multiple levels               have to be carefully chosen

Strategy #2: Use the 1% or 10% sample provided by the
Streaming API
+ generally assumed to be        - time frame has to be
  representative (of Twitter)      carefully chosen

Strategy #3: Capture Twitter's entire throughput
+ highly representative         - technically very difficult/costly
  (of Twitter)
Summary
        develop a question/general direction



       collect data using these or other tools


      store in a database or spreadsheet (CSV)



annotate, analyze and visualize using a variety of tools
        (Excel, Tableau, R, Gephi, NVIVO, ...)
Questions?




http://www.teachthought.com/wp-content/uploads/2012/11/twitter-logo-hashtag.jpg

Collecting Twitter Data

  • 1.
    Collecting Twitter data Dr. Cornelius Puschmann School of Library and Information Science Humboldt-University of Berlin / Humboldt Institute for Internet and Society 16 April 2013 Royal Statistical Society
  • 2.
    Overview 1. Examples ofresearch using Twitter data 2. Twitter's data infrastructure 3. Tools for collecting data 4. Sampling issues
  • 3.
    Examples of researchusing Twitter data • Kwak, H., Lee, C., Park, H., & Moon, S. (2010). What is Twitter, a Social Network or a News Media ? Categories and Subject Descriptors. Proceedings of the 19th International Conference on the World Wide Web (WWW ’10) (pp. 591–600). Raleigh, NC. • González-Bailón, S., Borge-Holthoefer, J., Rivero, A., & Moreno,Y. (2011). The dynamics of protest recruitment through an online network. Scientific reports, 1, 197. doi:10.1038/srep00197 • Ausserhofer, J., & Maireder, A. (2013). National politics on Twitter: Structures and topics of a networked public sphere. Information, Communication & Society, 16(3), 291–314. doi:10.1080/1369118X.2012.756050 • Papacharissi, Z., & De Fatima Oliveira, M. (2012). Affective News and Networked Publics: The Rhythms of News Storytelling on #Egypt. Journal of Communication, 62(2), 266–282. doi:10.1111/j.1460-2466.2012.01630.x
  • 4.
    Example questions Twitter asa platform • How can Twitter's structure be described? Social graph • Who follows whom? • How does information spread? Hashtags, keywords, and geography • How can the discussion of topic X be characterized? • Who is participating in discussions on X? • Where are users discussing X?
  • 5.
    Example questions URLs inTwitter • How is mass media content discussed? • How are academic papers cited on Twitter? Creative approaches • Where, when, and with what devices do people call taxis? Prediction/application • Can election results/flu outbreaks/consumption patterns be reliably predicted?
  • 6.
    #phdchat data set(30k tweets)
  • 7.
  • 8.
    Extracting Twitter data HTTPrequest return all data from a given user/hashtag/geolocation/... Application Programming Interface (API) Data (usually in a database or spreadsheet)
  • 9.
  • 10.
    Three Twitter APIs RESTAPI 1) data: tweets,API Streaming social graph Search API • traditionally used complex tools needed • same functionality 2) • public, user, and by most client 3) constraints on how site streams as Twitter search software much data can data in • • provides be captured rate-limited • v1.0 will be phased real time and out in May 2013 largely • to be replaced by unprocessed as it more restrictive flows through the v1.1 platform
  • 11.
    Legal issues: Twitter'sterms of service "By submitting, posting or displaying Content on or through the Services, you grant us a worldwide, non-exclusive, royalty-free license (with the right to sublicense) to use, copy, reproduce, process, adapt, modify, publish, transmit, display and distribute such Content in any and all media or distribution methods (now known or later developed)." "You agree that this license includes the right for Twitter to make such Content available to other companies, organizations or individuals who partner with Twitter for the syndication, broadcast, distribution or publication of such Content on other media and services, subject to our terms and conditions for such Content use." "We encourage and permit broad re-use of Content. The Twitter API exists to enable this."
  • 12.
    Legal issues: APIrules "You will not attempt or encourage others to: sell, rent, lease, sublicense, redistribute, or syndicate access to the Twitter API or Twitter Content to any third party without prior written approval from Twitter. If you provide an API that returns Twitter data, you may only return IDs (including tweet IDs and user IDs).You may export or extract non- programmatic, GUI-driven Twitter Content as a PDF or spreadsheet by using "save as" or similar functionality. Exporting Twitter Content to a datastore as a service or other cloud based service, however, is not permitted." "Except as permitted through the Services (or these Terms), you have to use the Twitter API if you want to reproduce, modify, create derivative works, distribute, sell, transfer, publicly display, publicly perform, transmit, or otherwise use the Content or Services."
  • 13.
  • 14.
    yourTwapperKeeper (runs on adedicated web server)
  • 15.
    140kit (hosted platform for academic research)
  • 16.
  • 17.
    Sampling approaches Strategy #1:Sample by hashtag, keyword, user, geographical location, or other filtering parameters + representativeness unclear - time frame and parameters on multiple levels have to be carefully chosen Strategy #2: Use the 1% or 10% sample provided by the Streaming API + generally assumed to be - time frame has to be representative (of Twitter) carefully chosen Strategy #3: Capture Twitter's entire throughput + highly representative - technically very difficult/costly (of Twitter)
  • 18.
    Summary develop a question/general direction collect data using these or other tools store in a database or spreadsheet (CSV) annotate, analyze and visualize using a variety of tools (Excel, Tableau, R, Gephi, NVIVO, ...)
  • 19.