0
collectingtwitter dataw /social feed managerDaniel Chudnov - @dchud - dchud at gwu eduELAG 2013 - 2013-05-30 - Ghent, Belg...
social-feed-manager•python / django•user timelines, filter,sample, search•simple display / exportfor user timelines•free so...
social feed managergithub.com /gwu-libraries /social-feed-manager
github.com / gwu-libraries / social-feed-manager
atraditional project
1expand scopeofcollection development
2at-riske-resourcelicensing story
3save the timeof theresearcher
let’s startwiththe researcher
“How Mainstream NewsOutlets Use Twitter” (2011)• GWU’s Kimberly Gross (SMPA) +students• Pew Research Center’s Project forE...
how do researchersstudy social media?
by hand.
•google reader•copy and paste•fold, spindle, mutilate•excel•...eventually, SPSSand similar tools
whateverhelpthey can get
it’s a lot of workfor not a lot of data
(1000s of tweets)
copy and pasteto exceldoesn’t scalejust ask any student assigned to do this!
first tweet, in native JSON
astrategicdisadvantage
5,000+theses/dissertationssince 2010(not all CS grad students)
see Leetaru et al.May 2013First Monday
librarians can help here
what researchers ask for•specific users, keywords•historic time periods•basic values: user, date, text,counts•10000s, not 1...
optionsforhistorical data?
Twitter-licenseddata providers:DataSiftGnipTopsy
data providers•friendly•not cheap•more than we need•expensive•still need tools tocollect, process, etc.
what can we doourselves?
social feed managergithub.com /gwu-libraries /social-feed-manager
what researchers ask for•specific users, keywords•historic time periods•basic values: user, date, text,counts•10000s, not 1...
can do thisfreew/public API
twitter api•user timelines•filter streams•spritzer•search
up to 3,200most recent tweetsany public user200 at a timeand go back again for more later
dev.twitter.com/docs/working-with-timelines
1,969,760 tweetsfrom1,228 users
group users in setsexport by user / setall at onceor time slices
40+ media outlets400+ elected officials300+ journalists300+ GWU groups
filter streams
millions of tweetsas they occuraround an event
filter streams* a little more complicated than that•filter by users, keywords, geo•about 3,000 tweets / min *•10,000,000s of...
spritzer feed•~0.5% of all public tweets•~3,000,000 tweets / day(growing)•a useful random sampling
search•after an event•find users, keywords•limited - better than nothing
we can doall thisat no marginal costfor data** not really “big data” - GBs, not TBs
this muchalonemeets several needs
this muchaloneshows at-risk nature
when the Pope resigned
when Congress turned over•16+ accounts deleted /hidden•combined 105,993 followers•14,479 tweets saved in SFMno longer public
if a researcher needs more•support selection,acquisition, accession,storage, transformation•collect what’s free aroundit t...
next steps
improving sfm•support concurrent per-userfilters / streams•add Sina Weibo,YouTube,others as asked
driveselective, automatedweb archiving
ensureyou can usesfmyou can have it! it’s free to use, copify, modify, redistribute
discovery?
theobvious solution
653 - subject added entry, uncontrolled for hashtags
700 - name added entries for mentions
856 42 - URL of related resource for included links
500 - note for retweet count
336, 337, 338 - RDA ready!
w / catmanduslinging data aroundis fun and easy!already indexed piles of tweets in ElasticSearch** really!
we will add2 - 4 millioncatalog recordsper month
WorldCatcan handle thisit’s web scale!
augmenting / creatingauthority recordsw / twitter screen namesalready cleared it with a PCC / NACO rep!
Summoncan handle thisAndrew is very familiar with growing consortial catalogs!
github.com /gwu-libraries /social-feed-manager@dchuddchud @ gwu edu
collecting twitter data w/social feed manager
collecting twitter data w/social feed manager
collecting twitter data w/social feed manager
collecting twitter data w/social feed manager
collecting twitter data w/social feed manager
Upcoming SlideShare
Loading in...5
×

collecting twitter data w/social feed manager

674

Published on

a talk given at ELAG 2013 in Ghent, Belgium, on May 30, 2013.

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
674
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
10
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Transcript of "collecting twitter data w/social feed manager"

  1. 1. collectingtwitter dataw /social feed managerDaniel Chudnov - @dchud - dchud at gwu eduELAG 2013 - 2013-05-30 - Ghent, Belgiumtinyurl.com / dchud-elag-2013
  2. 2. social-feed-manager•python / django•user timelines, filter,sample, search•simple display / exportfor user timelines•free software, on github
  3. 3. social feed managergithub.com /gwu-libraries /social-feed-manager
  4. 4. github.com / gwu-libraries / social-feed-manager
  5. 5. atraditional project
  6. 6. 1expand scopeofcollection development
  7. 7. 2at-riske-resourcelicensing story
  8. 8. 3save the timeof theresearcher
  9. 9. let’s startwiththe researcher
  10. 10. “How Mainstream NewsOutlets Use Twitter” (2011)• GWU’s Kimberly Gross (SMPA) +students• Pew Research Center’s Project forExcellence in Journalism• “news agenda these organizationspromoted on Twitter closely matchesthat of their legacy platforms”http://www.journalism.org/analysis_report/how_mainstream_media_outlets_use_twitter
  11. 11. how do researchersstudy social media?
  12. 12. by hand.
  13. 13. •google reader•copy and paste•fold, spindle, mutilate•excel•...eventually, SPSSand similar tools
  14. 14. whateverhelpthey can get
  15. 15. it’s a lot of workfor not a lot of data
  16. 16. (1000s of tweets)
  17. 17. copy and pasteto exceldoesn’t scalejust ask any student assigned to do this!
  18. 18. first tweet, in native JSON
  19. 19. astrategicdisadvantage
  20. 20. 5,000+theses/dissertationssince 2010(not all CS grad students)
  21. 21. see Leetaru et al.May 2013First Monday
  22. 22. librarians can help here
  23. 23. what researchers ask for•specific users, keywords•historic time periods•basic values: user, date, text,counts•10000s, not 10000000s•delimited files to import
  24. 24. optionsforhistorical data?
  25. 25. Twitter-licenseddata providers:DataSiftGnipTopsy
  26. 26. data providers•friendly•not cheap•more than we need•expensive•still need tools tocollect, process, etc.
  27. 27. what can we doourselves?
  28. 28. social feed managergithub.com /gwu-libraries /social-feed-manager
  29. 29. what researchers ask for•specific users, keywords•historic time periods•basic values: user, date, text,counts•10000s, not 10000000s•delimited files to import
  30. 30. can do thisfreew/public API
  31. 31. twitter api•user timelines•filter streams•spritzer•search
  32. 32. up to 3,200most recent tweetsany public user200 at a timeand go back again for more later
  33. 33. dev.twitter.com/docs/working-with-timelines
  34. 34. 1,969,760 tweetsfrom1,228 users
  35. 35. group users in setsexport by user / setall at onceor time slices
  36. 36. 40+ media outlets400+ elected officials300+ journalists300+ GWU groups
  37. 37. filter streams
  38. 38. millions of tweetsas they occuraround an event
  39. 39. filter streams* a little more complicated than that•filter by users, keywords, geo•about 3,000 tweets / min *•10,000,000s of tweets•political debates, news events
  40. 40. spritzer feed•~0.5% of all public tweets•~3,000,000 tweets / day(growing)•a useful random sampling
  41. 41. search•after an event•find users, keywords•limited - better than nothing
  42. 42. we can doall thisat no marginal costfor data** not really “big data” - GBs, not TBs
  43. 43. this muchalonemeets several needs
  44. 44. this muchaloneshows at-risk nature
  45. 45. when the Pope resigned
  46. 46. when Congress turned over•16+ accounts deleted /hidden•combined 105,993 followers•14,479 tweets saved in SFMno longer public
  47. 47. if a researcher needs more•support selection,acquisition, accession,storage, transformation•collect what’s free aroundit to minimize cost•plan purchase via grant•collect prospectively
  48. 48. next steps
  49. 49. improving sfm•support concurrent per-userfilters / streams•add Sina Weibo,YouTube,others as asked
  50. 50. driveselective, automatedweb archiving
  51. 51. ensureyou can usesfmyou can have it! it’s free to use, copify, modify, redistribute
  52. 52. discovery?
  53. 53. theobvious solution
  54. 54. 653 - subject added entry, uncontrolled for hashtags
  55. 55. 700 - name added entries for mentions
  56. 56. 856 42 - URL of related resource for included links
  57. 57. 500 - note for retweet count
  58. 58. 336, 337, 338 - RDA ready!
  59. 59. w / catmanduslinging data aroundis fun and easy!already indexed piles of tweets in ElasticSearch** really!
  60. 60. we will add2 - 4 millioncatalog recordsper month
  61. 61. WorldCatcan handle thisit’s web scale!
  62. 62. augmenting / creatingauthority recordsw / twitter screen namesalready cleared it with a PCC / NACO rep!
  63. 63. Summoncan handle thisAndrew is very familiar with growing consortial catalogs!
  64. 64. github.com /gwu-libraries /social-feed-manager@dchuddchud @ gwu edu
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×