Advertisement

collecting twitter data w/social feed manager

May. 31, 2013
Advertisement

More Related Content

Advertisement

collecting twitter data w/social feed manager

  1. collecting twitter data w / social feed manager Daniel Chudnov - @dchud - dchud at gwu edu ELAG 2013 - 2013-05-30 - Ghent, Belgium tinyurl.com / dchud-elag-2013
  2. social-feed-manager •python / django •user timelines, filter, sample, search •simple display / export for user timelines •free software, on github
  3. social feed manager github.com / gwu-libraries / social-feed-manager
  4. github.com / gwu-libraries / social-feed-manager
  5. a traditional project
  6. 1 expand scope of collection development
  7. 2 at-risk e-resource licensing story
  8. 3 save the time of the researcher
  9. let’s start with the researcher
  10. “How Mainstream News Outlets Use Twitter” (2011) • GWU’s Kimberly Gross (SMPA) + students • Pew Research Center’s Project for Excellence in Journalism • “news agenda these organizations promoted on Twitter closely matches that of their legacy platforms” http://www.journalism.org/analysis_report/ how_mainstream_media_outlets_use_twitter
  11. how do researchers study social media?
  12. by hand.
  13. •google reader •copy and paste •fold, spindle, mutilate •excel •...eventually, SPSS and similar tools
  14. whatever help they can get
  15. it’s a lot of work for not a lot of data
  16. (1000s of tweets)
  17. copy and paste to excel doesn’t scale just ask any student assigned to do this!
  18. first tweet, in native JSON
  19. a strategic disadvantage
  20. 5,000+ theses/dissertations since 2010 (not all CS grad students)
  21. see Leetaru et al. May 2013 First Monday
  22. librarians can help here
  23. what researchers ask for •specific users, keywords •historic time periods •basic values: user, date, text, counts •10000s, not 10000000s •delimited files to import
  24. options for historical data?
  25. Twitter-licensed data providers: DataSift Gnip Topsy
  26. data providers •friendly •not cheap •more than we need •expensive •still need tools to collect, process, etc.
  27. what can we do ourselves ?
  28. social feed manager github.com / gwu-libraries / social-feed-manager
  29. what researchers ask for •specific users, keywords •historic time periods •basic values: user, date, text, counts •10000s, not 10000000s •delimited files to import
  30. can do this free w/public API
  31. twitter api •user timelines •filter streams •spritzer •search
  32. up to 3,200 most recent tweets any public user 200 at a time and go back again for more later
  33. dev.twitter.com/docs/working-with-timelines
  34. 1,969,760 tweets from 1,228 users
  35. group users in sets export by user / set all at once or time slices
  36. 40+ media outlets 400+ elected officials 300+ journalists 300+ GWU groups
  37. filter streams
  38. millions of tweets as they occur around an event
  39. filter streams * a little more complicated than that •filter by users, keywords, geo •about 3,000 tweets / min * •10,000,000s of tweets •political debates, news events
  40. spritzer feed •~0.5% of all public tweets •~3,000,000 tweets / day (growing) •a useful random sampling
  41. search •after an event •find users, keywords •limited - better than nothing
  42. we can do all this at no marginal cost for data* * not really “big data” - GBs, not TBs
  43. this much alone meets several needs
  44. this much alone shows at-risk nature
  45. when the Pope resigned
  46. when Congress turned over •16+ accounts deleted / hidden •combined 105,993 followers •14,479 tweets saved in SFM no longer public
  47. if a researcher needs more •support selection, acquisition, accession, storage, transformation •collect what’s free around it to minimize cost •plan purchase via grant •collect prospectively
  48. next steps
  49. improving sfm •support concurrent per-user filters / streams •add Sina Weibo,YouTube, others as asked
  50. drive selective, automated web archiving
  51. ensure you can use sfm you can have it! it’s free to use, copify, modify, redistribute
  52. discovery ?
  53. the obvious solution
  54. 653 - subject added entry, uncontrolled for hashtags
  55. 700 - name added entries for mentions
  56. 856 42 - URL of related resource for included links
  57. 500 - note for retweet count
  58. 336, 337, 338 - RDA ready!
  59. w / catmandu slinging data around is fun and easy! already indexed piles of tweets in ElasticSearch* * really!
  60. we will add 2 - 4 million catalog records per month
  61. WorldCat can handle this it’s web scale!
  62. augmenting / creating authority records w / twitter screen names already cleared it with a PCC / NACO rep!
  63. Summon can handle this Andrew is very familiar with growing consortial catalogs!
  64. github.com / gwu-libraries / social-feed-manager @dchud dchud @ gwu edu
Advertisement