Analyzing Twitter data
Issues
  Challenges
    and
      Opportunities



RC33 Conference, Sydney Australia,
9-13 July 2012



Maurice Vergeer
m.vergeer@maw.ru.nl / www.mauricevergeer.nl / blog.mauricevergeer.nl
Radboud University Nijmegen, the Netherlands
   Many platform       Empty platform /
    -   Facebook         infrastructure
    -   Twitter          - Facility
    -   Linkedin
    -   Hyves
    -   RenRen
    -   Cyworld         User generated content
    -   Orkut            -   Text
    -   Youtube          -   Audio
    -   Flickr           -   Video
    -   Plurk            -   Pictures
    -   Sina Weibo
    -   Etc



Social media
Number of articles on politics, Internet and social media
                     180


                     160


                     140


                     120
Number of articles




                     100


                      80


                      60


                      40


                      20


                       0
                           1995   1996   1997    1998    1999   2000   2001   2002    2003    2004       2005   2006   2007   2008    2009    2010    2011      2012
                             Internet and politics (query 1)       Social media and politics (query 2)          Internet, social media and politics (query 3)


Source: Vergeer (in press / 2012) in New Media & Society
Focus on Twitter
The Netherlands



  A special case?
   Opportunities
    ◦ Methodological/technical
       Timeseries analysis
       Network analysis
        ◦ Actors
        ◦ Content
        ◦ Diffusion of information through onine social networks
        ◦ Social media activities

   Limitations
    ◦ Twitter
       Reliability of Twitter API




Outline
•   Within Twitter (using the API)
    • Username
    • Account creation data
    • # of followers
      • And the actual usernames of these followers
    • # of followers
      • And the actual usernames of those being followed
    • Tweet text

    • And many more (see dev.twitter.com)




Data sources
   Tweet
    ◦ Tweet text

    ◦ Whether or not it was a reply to another tweet
       To whom it was a reply (username/screenname and numerical
        userid)

    ◦ Whether or not it was a retweet (according to Twitter)
       Which tweet was retweeted (nunerical tweetid)
   Message of tweet

   Whether or not is was a directed tweet
    (sent to someone in particular)
    ◦ Identified by an @-sign


   Whether or not is was a retweet
    ◦ Identified by RT




Type of content
   Undirected tweet
    ◦ RCMP Commissioner appearing before Public Safety Cmte now.
      What a popular guy - he has his own paparazzi!

   Directed tweet
    ◦ Fantastic blog by my good friend @GlenPearson -
      http://bit.ly/hlAKXp #lpc

   Directed tweet to two usernames
    ◦ @miken32 @CBCEdmonton probably because that is NOT what I
      said--more commercially viable is different than not needed.

   Retweet
    ◦ RT @liberal_party: Think Durham deserves better than Bev Oda?
      Join @BobRaeMP for a rally tomorrow at 1pm http://lpc.ca/durham
      #cdnpoli #lpc




Tweet examples
   Traditional material
    ◦ Produced by professional actors
    ◦ Newspapers
    ◦ Public administration documents

   Social media
    ◦ Produced by
       professional actors
       general public




Content analysis of tweets
   Large quantities of data

   Word frequencies
    ◦ Identifying the most important words in the corpus
    ◦ Code these words into more general categories

   Switch to SPSS (or other type of data management tool)
    ◦ Search for the words in the actual tweets
    ◦ Assign tweet to a specific code

   Improvements in SPSS
    ◦ Compute command facilitates many new text operators
    ◦ Char.index, Char.substr, etc

   Alternative
    ◦ Regular expressions
    ◦ complex




Data extraction
   Publicly available data sources on
    parliament, election council

   Time series
    ◦ Identifying relevant societal/political events
      relevant for the study at hand
      Ex.1 temporarily shut down of election campaign
       due to passenger plane crash of Dutch airliner in
       Libia My 2010
      Ex.2 Deregistration of People s Political Power
       Party of Canada




External data sources
900


800


700


600


500


400


300


200


100


  0
      newspaper   broadcasting    radio    news agency    magazine   online only   local

                          institutional Twitter account       Personal Twitter account     9
Source: Vergeer & Hermans (forthcoming / 2013)
in Journal of Computer-Mediated Communication
1000




                               0
                                   100
                                         200
                                                           500
                                                                             800
                                                                                   900




                                               300
                                                     400
                                                                 600
                                                                       700
                 01-mei-2010
                 02-mei-2010
                 03-mei-2010
                 04-mei-2010
                 05-mei-2010
                 06-mei-2010
                 07-mei-2010
                 08-mei-2010
                 09-mei-2010




          CDA
PvdD
                 10-mei-2010
                 11-mei-2010
                 12-mei-2010




SGP
          PvdA
                 13-mei-2010
                 14-mei-2010
                 15-mei-2010




          SP
NN
                 16-mei-2010
                 17-mei-2010
                 18-mei-2010




          VVD
TON
                 19-mei-2010
                 20-mei-2010
                 21-mei-2010




          PVV
                 22-mei-2010




MenS
                 23-mei-2010
                 24-mei-2010



          GL
HNL
                 25-mei-2010
                 26-mei-2010
                 27-mei-2010
          CU

                 28-mei-2010
Partij1

                 29-mei-2010
                 30-mei-2010
                 31-mei-2010
          D66
Piraten




                 01-jun-2010
                 02-jun-2010
                 03-jun-2010
                 04-jun-2010
                 05-jun-2010
                 06-jun-2010
                 07-jun-2010
                 08-jun-2010
                 09-jun-2010
   Date and time

   For longitudinal analysis and cross-national comparisons
    ◦ take note of the time differences and correct if necessary.
        Time zones
        Daytime saving

   What to do with countries having multiple time zones?
    ◦ Depends on RQs
       Communication patterns: keep a single time zone
       Focus on individual daily patterns: adjust for time zones
   Total tweets by candidates, followers and followed:
    ◦ 4,536,854 tweets

   Breakdown
    ◦ Tweets among candidates:                            appr 2%
    ◦ Tweets to inner circles (followers or being followed)
       appr 18%
    ◦ Tweets to outer circle:                                  appr
      33%
    ◦ Tweets not directed to anyone in particular              appr
      49%

    ◦ Extracting users from tweets (@adresses)




Communication network analysis
 Communication network based on
  candidates identified in tweets
 Excluding the general public




Communication network analysis
   See http://tinyurl.com/blzajsl for
    animated version.
   Retrospective
    ◦ 3200 tweets back in time

   Cost technical
    ◦ Access to firehose for real time data




Limitations in data collection
   Date of tweet
    ◦ Minute fraction is time stamped with the wrong date
   Solution
    ◦ Estimate date and time using the tweetid

   Status of tweet as retweet
    ◦ RT
   Solution:
       Use text search operators to identify real retweets (“RT ”, “rt “)
        Also see http://tinyurl.com/bohhjzn

   Reply to tweets
    ◦ Only the first address is identified
   Solution
    ◦ Search for multiple @-addresses using text extraction methods



Reliability of data as provided by
the API
BIG DATA

The buzz word of these days
 Not gigabyte, ot terabytes,
 But petabytes and exabytes of data
 Only for the few
 Specific hardware requirements
    ◦ Computing power
    ◦ Data storage
   The data presented in this presentation
    ◦ Appr 4.5 million records equals appr 1
      gigabyte, not that Big
There is still so much to be done
with…
•   Focus on specific cases
     -political communication:
         politicians – candidates in elections
     -fan studies
         celebrities
         cast of popular soap opera’s
    ◦ -journalism studies
         journalists and newspapers





Focus on specific cases
 actor information
 information on societal events
 accumulate data over time using the
  same data structure
    ◦ Proonged analysis
    ◦ Multuple case studies, cross-national
      comparative analysis




Enrich existing Twitter data with
external data
   Traditional process (textbook approach)
    ◦ RQ -> research design

   Practice, particularly with secondaire (i.e. third party) data
    ◦ Data  RQ  research design
    ◦ Data  research design  RQ

Twitter
    Content analysis
    Longitudinal analysis
    Network analysis

   Different research designs requires different techniques
   Collaborate



Look at the data from different
angles, i.e. research designs
Thank you for your attention

Social media presentation held at RC33 conference, Sydney, Australia

  • 1.
    Analyzing Twitter data Issues Challenges and Opportunities RC33 Conference, Sydney Australia, 9-13 July 2012 Maurice Vergeer m.vergeer@maw.ru.nl / www.mauricevergeer.nl / blog.mauricevergeer.nl Radboud University Nijmegen, the Netherlands
  • 2.
    Many platform  Empty platform / - Facebook infrastructure - Twitter - Facility - Linkedin - Hyves - RenRen - Cyworld  User generated content - Orkut - Text - Youtube - Audio - Flickr - Video - Plurk - Pictures - Sina Weibo - Etc Social media
  • 3.
    Number of articleson politics, Internet and social media 180 160 140 120 Number of articles 100 80 60 40 20 0 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 Internet and politics (query 1) Social media and politics (query 2) Internet, social media and politics (query 3) Source: Vergeer (in press / 2012) in New Media & Society
  • 4.
  • 5.
    The Netherlands A special case?
  • 7.
    Opportunities ◦ Methodological/technical  Timeseries analysis  Network analysis ◦ Actors ◦ Content ◦ Diffusion of information through onine social networks ◦ Social media activities  Limitations ◦ Twitter  Reliability of Twitter API Outline
  • 8.
    Within Twitter (using the API) • Username • Account creation data • # of followers • And the actual usernames of these followers • # of followers • And the actual usernames of those being followed • Tweet text • And many more (see dev.twitter.com) Data sources
  • 9.
    Tweet ◦ Tweet text ◦ Whether or not it was a reply to another tweet  To whom it was a reply (username/screenname and numerical userid) ◦ Whether or not it was a retweet (according to Twitter)  Which tweet was retweeted (nunerical tweetid)
  • 10.
    Message of tweet  Whether or not is was a directed tweet (sent to someone in particular) ◦ Identified by an @-sign  Whether or not is was a retweet ◦ Identified by RT Type of content
  • 11.
    Undirected tweet ◦ RCMP Commissioner appearing before Public Safety Cmte now. What a popular guy - he has his own paparazzi!  Directed tweet ◦ Fantastic blog by my good friend @GlenPearson - http://bit.ly/hlAKXp #lpc  Directed tweet to two usernames ◦ @miken32 @CBCEdmonton probably because that is NOT what I said--more commercially viable is different than not needed.  Retweet ◦ RT @liberal_party: Think Durham deserves better than Bev Oda? Join @BobRaeMP for a rally tomorrow at 1pm http://lpc.ca/durham #cdnpoli #lpc Tweet examples
  • 13.
    Traditional material ◦ Produced by professional actors ◦ Newspapers ◦ Public administration documents  Social media ◦ Produced by  professional actors  general public Content analysis of tweets
  • 14.
    Large quantities of data  Word frequencies ◦ Identifying the most important words in the corpus ◦ Code these words into more general categories  Switch to SPSS (or other type of data management tool) ◦ Search for the words in the actual tweets ◦ Assign tweet to a specific code  Improvements in SPSS ◦ Compute command facilitates many new text operators ◦ Char.index, Char.substr, etc  Alternative ◦ Regular expressions ◦ complex Data extraction
  • 15.
    Publicly available data sources on parliament, election council  Time series ◦ Identifying relevant societal/political events relevant for the study at hand  Ex.1 temporarily shut down of election campaign due to passenger plane crash of Dutch airliner in Libia My 2010  Ex.2 Deregistration of People s Political Power Party of Canada External data sources
  • 16.
    900 800 700 600 500 400 300 200 100 0 newspaper broadcasting radio news agency magazine online only local institutional Twitter account Personal Twitter account 9
  • 17.
    Source: Vergeer &Hermans (forthcoming / 2013) in Journal of Computer-Mediated Communication
  • 19.
    1000 0 100 200 500 800 900 300 400 600 700 01-mei-2010 02-mei-2010 03-mei-2010 04-mei-2010 05-mei-2010 06-mei-2010 07-mei-2010 08-mei-2010 09-mei-2010 CDA PvdD 10-mei-2010 11-mei-2010 12-mei-2010 SGP PvdA 13-mei-2010 14-mei-2010 15-mei-2010 SP NN 16-mei-2010 17-mei-2010 18-mei-2010 VVD TON 19-mei-2010 20-mei-2010 21-mei-2010 PVV 22-mei-2010 MenS 23-mei-2010 24-mei-2010 GL HNL 25-mei-2010 26-mei-2010 27-mei-2010 CU 28-mei-2010 Partij1 29-mei-2010 30-mei-2010 31-mei-2010 D66 Piraten 01-jun-2010 02-jun-2010 03-jun-2010 04-jun-2010 05-jun-2010 06-jun-2010 07-jun-2010 08-jun-2010 09-jun-2010
  • 20.
    Date and time  For longitudinal analysis and cross-national comparisons ◦ take note of the time differences and correct if necessary.  Time zones  Daytime saving  What to do with countries having multiple time zones? ◦ Depends on RQs  Communication patterns: keep a single time zone  Focus on individual daily patterns: adjust for time zones
  • 21.
    Total tweets by candidates, followers and followed: ◦ 4,536,854 tweets  Breakdown ◦ Tweets among candidates: appr 2% ◦ Tweets to inner circles (followers or being followed) appr 18% ◦ Tweets to outer circle: appr 33% ◦ Tweets not directed to anyone in particular appr 49% ◦ Extracting users from tweets (@adresses) Communication network analysis
  • 22.
     Communication networkbased on candidates identified in tweets  Excluding the general public Communication network analysis
  • 24.
    See http://tinyurl.com/blzajsl for animated version.
  • 25.
    Retrospective ◦ 3200 tweets back in time  Cost technical ◦ Access to firehose for real time data Limitations in data collection
  • 26.
    Date of tweet ◦ Minute fraction is time stamped with the wrong date  Solution ◦ Estimate date and time using the tweetid  Status of tweet as retweet ◦ RT  Solution:  Use text search operators to identify real retweets (“RT ”, “rt “) Also see http://tinyurl.com/bohhjzn  Reply to tweets ◦ Only the first address is identified  Solution ◦ Search for multiple @-addresses using text extraction methods Reliability of data as provided by the API
  • 27.
    BIG DATA The buzzword of these days
  • 28.
     Not gigabyte,ot terabytes,  But petabytes and exabytes of data
  • 29.
     Only forthe few  Specific hardware requirements ◦ Computing power ◦ Data storage  The data presented in this presentation ◦ Appr 4.5 million records equals appr 1 gigabyte, not that Big
  • 30.
    There is stillso much to be done with…
  • 31.
    Focus on specific cases  -political communication:  politicians – candidates in elections  -fan studies  celebrities  cast of popular soap opera’s ◦ -journalism studies  journalists and newspapers  Focus on specific cases
  • 32.
     actor information information on societal events  accumulate data over time using the same data structure ◦ Proonged analysis ◦ Multuple case studies, cross-national comparative analysis Enrich existing Twitter data with external data
  • 33.
    Traditional process (textbook approach) ◦ RQ -> research design  Practice, particularly with secondaire (i.e. third party) data ◦ Data  RQ  research design ◦ Data  research design  RQ Twitter  Content analysis  Longitudinal analysis  Network analysis  Different research designs requires different techniques  Collaborate Look at the data from different angles, i.e. research designs
  • 34.
    Thank you foryour attention