Analyzing Twitter dataIssues  Challenges    and      OpportunitiesRC33 Conference, Sydney Australia,9-13 July 2012Maurice ...
   Many platform       Empty platform /    -   Facebook         infrastructure    -   Twitter          - Facility    -  ...
Number of articles on politics, Internet and social media                     180                     160                 ...
Focus on Twitter
The Netherlands  A special case?
   Opportunities    ◦ Methodological/technical       Timeseries analysis       Network analysis        ◦ Actors        ...
•   Within Twitter (using the API)    • Username    • Account creation data    • # of followers      • And the actual user...
   Tweet    ◦ Tweet text    ◦ Whether or not it was a reply to another tweet       To whom it was a reply (username/scre...
   Message of tweet   Whether or not is was a directed tweet    (sent to someone in particular)    ◦ Identified by an @-...
   Undirected tweet    ◦ RCMP Commissioner appearing before Public Safety Cmte now.      What a popular guy - he has his ...
   Traditional material    ◦ Produced by professional actors    ◦ Newspapers    ◦ Public administration documents   Soci...
   Large quantities of data   Word frequencies    ◦ Identifying the most important words in the corpus    ◦ Code these w...
   Publicly available data sources on    parliament, election council   Time series    ◦ Identifying relevant societal/p...
900800700600500400300200100  0      newspaper   broadcasting    radio    news agency    magazine   online only   local    ...
Source: Vergeer & Hermans (forthcoming / 2013)in Journal of Computer-Mediated Communication
1000                               0                                   100                                         200    ...
   Date and time   For longitudinal analysis and cross-national comparisons    ◦ take note of the time differences and c...
   Total tweets by candidates, followers and followed:    ◦ 4,536,854 tweets   Breakdown    ◦ Tweets among candidates:  ...
 Communication network based on  candidates identified in tweets Excluding the general publicCommunication network analy...
   See http://tinyurl.com/blzajsl for    animated version.
   Retrospective    ◦ 3200 tweets back in time   Cost technical    ◦ Access to firehose for real time dataLimitations in...
   Date of tweet    ◦ Minute fraction is time stamped with the wrong date   Solution    ◦ Estimate date and time using t...
BIG DATAThe buzz word of these days
 Not gigabyte, ot terabytes, But petabytes and exabytes of data
 Only for the few Specific hardware requirements    ◦ Computing power    ◦ Data storage   The data presented in this pr...
There is still so much to be donewith…
•   Focus on specific cases     -political communication:         politicians – candidates in elections     -fan studie...
 actor information information on societal events accumulate data over time using the  same data structure    ◦ Proonge...
   Traditional process (textbook approach)    ◦ RQ -> research design   Practice, particularly with secondaire (i.e. thi...
Thank you for your attention
Social media presentation held at RC33 conference, Sydney, Australia
Social media presentation held at RC33 conference, Sydney, Australia
Social media presentation held at RC33 conference, Sydney, Australia
Social media presentation held at RC33 conference, Sydney, Australia
Upcoming SlideShare
Loading in …5
×

Social media presentation held at RC33 conference, Sydney, Australia

625 views
551 views

Published on

Published in: Education, Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
625
On SlideShare
0
From Embeds
0
Number of Embeds
173
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Social media presentation held at RC33 conference, Sydney, Australia

  1. 1. Analyzing Twitter dataIssues Challenges and OpportunitiesRC33 Conference, Sydney Australia,9-13 July 2012Maurice Vergeerm.vergeer@maw.ru.nl / www.mauricevergeer.nl / blog.mauricevergeer.nlRadboud University Nijmegen, the Netherlands
  2. 2.  Many platform  Empty platform / - Facebook infrastructure - Twitter - Facility - Linkedin - Hyves - RenRen - Cyworld  User generated content - Orkut - Text - Youtube - Audio - Flickr - Video - Plurk - Pictures - Sina Weibo - EtcSocial media
  3. 3. Number of articles on politics, Internet and social media 180 160 140 120Number of articles 100 80 60 40 20 0 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 Internet and politics (query 1) Social media and politics (query 2) Internet, social media and politics (query 3)Source: Vergeer (in press / 2012) in New Media & Society
  4. 4. Focus on Twitter
  5. 5. The Netherlands A special case?
  6. 6.  Opportunities ◦ Methodological/technical  Timeseries analysis  Network analysis ◦ Actors ◦ Content ◦ Diffusion of information through onine social networks ◦ Social media activities Limitations ◦ Twitter  Reliability of Twitter APIOutline
  7. 7. • Within Twitter (using the API) • Username • Account creation data • # of followers • And the actual usernames of these followers • # of followers • And the actual usernames of those being followed • Tweet text • And many more (see dev.twitter.com)Data sources
  8. 8.  Tweet ◦ Tweet text ◦ Whether or not it was a reply to another tweet  To whom it was a reply (username/screenname and numerical userid) ◦ Whether or not it was a retweet (according to Twitter)  Which tweet was retweeted (nunerical tweetid)
  9. 9.  Message of tweet Whether or not is was a directed tweet (sent to someone in particular) ◦ Identified by an @-sign Whether or not is was a retweet ◦ Identified by RTType of content
  10. 10.  Undirected tweet ◦ RCMP Commissioner appearing before Public Safety Cmte now. What a popular guy - he has his own paparazzi! Directed tweet ◦ Fantastic blog by my good friend @GlenPearson - http://bit.ly/hlAKXp #lpc Directed tweet to two usernames ◦ @miken32 @CBCEdmonton probably because that is NOT what I said--more commercially viable is different than not needed. Retweet ◦ RT @liberal_party: Think Durham deserves better than Bev Oda? Join @BobRaeMP for a rally tomorrow at 1pm http://lpc.ca/durham #cdnpoli #lpcTweet examples
  11. 11.  Traditional material ◦ Produced by professional actors ◦ Newspapers ◦ Public administration documents Social media ◦ Produced by  professional actors  general publicContent analysis of tweets
  12. 12.  Large quantities of data Word frequencies ◦ Identifying the most important words in the corpus ◦ Code these words into more general categories Switch to SPSS (or other type of data management tool) ◦ Search for the words in the actual tweets ◦ Assign tweet to a specific code Improvements in SPSS ◦ Compute command facilitates many new text operators ◦ Char.index, Char.substr, etc Alternative ◦ Regular expressions ◦ complexData extraction
  13. 13.  Publicly available data sources on parliament, election council Time series ◦ Identifying relevant societal/political events relevant for the study at hand  Ex.1 temporarily shut down of election campaign due to passenger plane crash of Dutch airliner in Libia My 2010  Ex.2 Deregistration of People s Political Power Party of CanadaExternal data sources
  14. 14. 900800700600500400300200100 0 newspaper broadcasting radio news agency magazine online only local institutional Twitter account Personal Twitter account 9
  15. 15. Source: Vergeer & Hermans (forthcoming / 2013)in Journal of Computer-Mediated Communication
  16. 16. 1000 0 100 200 500 800 900 300 400 600 700 01-mei-2010 02-mei-2010 03-mei-2010 04-mei-2010 05-mei-2010 06-mei-2010 07-mei-2010 08-mei-2010 09-mei-2010 CDAPvdD 10-mei-2010 11-mei-2010 12-mei-2010SGP PvdA 13-mei-2010 14-mei-2010 15-mei-2010 SPNN 16-mei-2010 17-mei-2010 18-mei-2010 VVDTON 19-mei-2010 20-mei-2010 21-mei-2010 PVV 22-mei-2010MenS 23-mei-2010 24-mei-2010 GLHNL 25-mei-2010 26-mei-2010 27-mei-2010 CU 28-mei-2010Partij1 29-mei-2010 30-mei-2010 31-mei-2010 D66Piraten 01-jun-2010 02-jun-2010 03-jun-2010 04-jun-2010 05-jun-2010 06-jun-2010 07-jun-2010 08-jun-2010 09-jun-2010
  17. 17.  Date and time For longitudinal analysis and cross-national comparisons ◦ take note of the time differences and correct if necessary.  Time zones  Daytime saving What to do with countries having multiple time zones? ◦ Depends on RQs  Communication patterns: keep a single time zone  Focus on individual daily patterns: adjust for time zones
  18. 18.  Total tweets by candidates, followers and followed: ◦ 4,536,854 tweets Breakdown ◦ Tweets among candidates: appr 2% ◦ Tweets to inner circles (followers or being followed) appr 18% ◦ Tweets to outer circle: appr 33% ◦ Tweets not directed to anyone in particular appr 49% ◦ Extracting users from tweets (@adresses)Communication network analysis
  19. 19.  Communication network based on candidates identified in tweets Excluding the general publicCommunication network analysis
  20. 20.  See http://tinyurl.com/blzajsl for animated version.
  21. 21.  Retrospective ◦ 3200 tweets back in time Cost technical ◦ Access to firehose for real time dataLimitations in data collection
  22. 22.  Date of tweet ◦ Minute fraction is time stamped with the wrong date Solution ◦ Estimate date and time using the tweetid Status of tweet as retweet ◦ RT Solution:  Use text search operators to identify real retweets (“RT ”, “rt “) Also see http://tinyurl.com/bohhjzn Reply to tweets ◦ Only the first address is identified Solution ◦ Search for multiple @-addresses using text extraction methodsReliability of data as provided bythe API
  23. 23. BIG DATAThe buzz word of these days
  24. 24.  Not gigabyte, ot terabytes, But petabytes and exabytes of data
  25. 25.  Only for the few Specific hardware requirements ◦ Computing power ◦ Data storage The data presented in this presentation ◦ Appr 4.5 million records equals appr 1 gigabyte, not that Big
  26. 26. There is still so much to be donewith…
  27. 27. • Focus on specific cases -political communication: politicians – candidates in elections -fan studies celebrities cast of popular soap opera’s ◦ -journalism studies journalists and newspapersFocus on specific cases
  28. 28.  actor information information on societal events accumulate data over time using the same data structure ◦ Proonged analysis ◦ Multuple case studies, cross-national comparative analysisEnrich existing Twitter data withexternal data
  29. 29.  Traditional process (textbook approach) ◦ RQ -> research design Practice, particularly with secondaire (i.e. third party) data ◦ Data  RQ  research design ◦ Data  research design  RQTwitter Content analysis Longitudinal analysis Network analysis Different research designs requires different techniques CollaborateLook at the data from differentangles, i.e. research designs
  30. 30. Thank you for your attention

×